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The  logarithmic  number  system  ( LNS ) has  recently  been 
shown  to  offer  a viable  alternative  to  traditional  computer 
arithmetic  number  systems,  when  a large  dynamic  range  and 
high  precision  are  required.  This  dissertation  can  be  con- 
sidered as  part  of  a recent  attempt  of  the  scientific  com- 
munity to  further  investigate  and  enhance  the  qualities  of 
LNS.  It  broadens  the  spectrum  of  available  LNS  arithmetic 
operations,  optimizes  its  design  parameters,  enhances 
memory  reduction  techniques  and  alleviates  problems  related 
to  the  short  wordlength  of  currently  existing  high-speed 
memory  tables.  Two  techniques,  based  on  sequential  and  ran- 
dom memory  access  principles,  are  investigated,  through  the 
vehicles  of  stochastic  analysis  and  dynamic  programming, 
with  exciting  results,  at  least  for  one  of  the  designs.  The 
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impact  of  this  design  on  the  processor  architecture  is 
analyzed  and  its  consequences  on  better  handling  the  issues 
of  fault  tolerance  and  recovery  through  reconfiguration  are 
examined.  An  attempt  to  systematize  the  development  of  ap- 
plications in  order  to  fully  exploit  the  LNS  attributes  is 
made  and  the  theoretical  operations  count  bound  is  ap- 
proached. Finally  a series  of  Digital  Signal  Processing  and 
other  applications,  where  the  conventional  systems  present 
drawbacks,  is  considered  and  appealing  LNS  solutions  are 
offered.  It  can  be  claimed  that  the  results  of  this 
research  make  feasible  the  VLSI  realization  of  a 20-bit  LNS 
processor . 
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CHAPTER  ONE 
INTRODUCTION 


Using  traditional  or  modified  von  Neumann 
architectures,  attached  array  processors  or  Digital  Signal 
Processing  (DSP)  chips,  DSP  has  undergone  a decade  of 
rapid  development  and  enjoyed  many  successes.  DSP 
scientists  and  technologists  have  been  accustomed,  over 
this  period,  to  responding  to  the  requirements  for 
increased  performance.  However,  there  is  a list  of 
problems  (multidimensional  or  real  time),  which  often 
require  processing  power  that  the  current  generation 
machinery  can  only  provide  "off-line."  Other  problems, 
being  adaptive,  require  a dynamic  reconfiguration  of  the 
system  during  run-time.  Others  may  require  multitasking, 
which  precludes  the  use  of  highly  tuned  unitask  DSP 
architectures  (e.g.  systolic  arrays).  Other  issues  of 
major  importance  are  the  ones  of  fault  tolerance  and 
recovery.  The  above  list  of  advanced  requirements  creates 
an  imperative  need  for  faster  and  more  intelligent  and 
accurate  systems. 

Recent  developments  in  the  area  of  the  Logarithmic 
Number  System  ( LNS ) have  proven  it  to  offer  some 
significant  advantages,  when  compared  to  other  number 
systems.  The  conventional  weighted  binary  systems,  being 
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either  all-fractional,  all-integer  or  floating-point  (FLP) 
systems  present  the  problems  of  relative  slowness  or  high 
circuit  complexity  for  operations  like  multiplication  or 
division.  On  the  other  hand,  modular  number  systems,  like 
the  Residue  Number  System  ( RNS ) , which  is  very  fast  due  to 
lack  of  necessity  for  propagation  of  carries,  face 
difficulties,  when  dealing  with  the  operations  of 
division,  overflow  detection  and  magnitude  comparison. 
The  main  advantages  of  LNS  are  the  extended  dynamic  range 
on  which  it  operates  and  the  high  speed  of  operations, 
accompanied  by  a remarkably  regular  data  flow.  The  LNS 
though,  while  exhibiting  many  positive  attributes,  has  not 
enjoyed  the  attention  given  to  other  systems.  Therefore, 
a number  of  fundamental  issues  remain  unsolved  at  this 
time.  For  the  LNS  to  be  fast  enough  to  justify  its  use, 
it  has  to  be  memory  intensive,  whenever  it  deals  with 
add/subtract  operations.  The  high-speed  memory  chips, 
though,  are  technologically  restricted  to  a limited 
wordlength  of  ~12  bits.  There  is  a need  for  methods 
allowing  to  increase  the  wordlength,  for  the  LNS  to  be 
sufficiently  accurate.  This  is  one  of  the  tasks 
accomplished  in  this  dissertation.  Two  advanced  LNS 
processors  are  examined.  These  are  the  Adaptive  Radix 
Processor  (ARP)  and  the  Associative  Memory  Processor 
(AMP).  The  former,  partitioning  the  dynamic  range,  is  a 
"random  memory  access"  approach  to  the  problem  of 


enhancing  the  precision  of  LNS  processors,  while  the 
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latter  is  a "sequential  access"  one.  The  two  processors 
are  researched  and  analyzed.  A statistical  analysis  on  the 
error  budgets  was  performed  using  stochastic  analysis 
mechanics  and  dynamic  programming  techniques.  The 
addressing  space  was  optimized  using  table  reduction 
techniques.  The  two  processors  were  compared  with  regard 
to  latency  and  the  total  number  of  storage-bits  required 
in  order  to  obtain  the  same  precision.  The  sensitivity  of 
AMP  to  the  location  of  the  radix  is  also  investigated. 
Alternative  number  systems  were  considered  for  integration 
with  the  best  of  the  two  designs  (ARP)  to  achieve  even 
better  results. 

Floating-point  number  representation  is  the 
dominating  choice  of  systems  designers,  whenever  extended 
dynamic  range  and  high  precision  are  simultaneously 
required.  Due  to  the  importance  of  this  system,  a full- 
scale  statistical  analysis  of  the  error  involved  in  the 
conversion  from  FLP  to  LNS  was  conducted. 

Since  one  of  the  major  advantages  of  LNS  over  other 
number  systems  is  the  increased  speed  at  which 
multiplications  and  divisions  are  performed,  the  lowest 
bound  of  algorithms  for  addition  and  subtraction  was 
targeted.  The  current  form  of  useful  DSP  operations,  like 
FFTs , convolutions,  correlations,  etc.,  was  checked 
against  these  theoretical  bounds. 

Specific  applications  of  LNS  were  also  examined,  with 
an  effort  to  systematize  the  whole  procedure  for 
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development  of  applications  on  a special-purpose  LNS 
processor,  exploiting  the  findings  of  the  present  research 
for  optimal  outcomes.  Some  other  applications  with  unique 
and  appealing  LNS  solutions  were  considered  as  well. 

The  architectural  consequences  of  the  new  processor 
designs  were  also  investigated.  The  regularity  of  the  data 
flow  enabled  to  envision  3-D  systems,  where  besides 
communicating  in  a traditional  plane  (card  level), 
multiple  and  regular  communication  paths  are  opened  in  the 
third  dimension  (intracard  level). 

The  successful  accomplishment  of  the  above  research 

tasks  resulted  in  a fully  theoretically  explored  and 

experimentally  tested  LNS  processor,  capable  of  operating 

8 9 

at  the  speed  of  10  - 10  floating-point  like  operations 
per  second  (FLOPS)  over  an  extended  dynamic  range  and  with 
sufficient  precision  to  serve  as  a special  purpose 
processor  in  a number  of  applications,  where  real-time 
processing  is  required.  A highly  desirable  processor 
evolved  in  terms  of  throughput,  cost,  complexity  and 
flexibility. 

The  organization  of  the  dissertation  is  as  follows: 
Chapter  Two  contains  a comprehensive  survey  of  the 
existing  literature.  This  survey  is  organized  along 
functional  lines  and  comprises  broadly  a description  of 
LNS  arithmetic  and  applications  related  literature. 
Chapter  Three  offers  the  LNS  number  representation 
followed  throughout  this  work,  along  with  already  existing 
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or  developed  here  algorithms  for  basic  arithmetic 
operations.  Chapter  Four  deals  with  the  conversion  problem 
and  Chapter  Five  summarizes  existing  memory  reduction 
techniques  and  suggests  new  ones  along  with  memory 
optimization  procedures.  Chapter  Six  describes  the  ARP 
and  analyzes  its  statistics.  It  also  suggests  alternative 
hybrid  Signed-digit  LNS  designs.  In  Chapter  Seven  the  AMP 
design  is  analyzed  for  optimal  results.  Existing 
theoretical  bounds  of  operation  count  for  basic  DSP 
algorithms  are  offered  in  Chapter  Eight,  and  several 
techniques  for  optimizing  the  multiplication/addition 
count  tradeoff  for  LNS  are  offered.  In  Chapter  Nine  a 
systematic  application  development  procedure  is  offered. 
It  is  accompanied  by  applications  of  LNS  in  areas,  where 
conventional  systems  present  drawbacks,  or  the  specific 
designs  present  interesting  features.  The  impact  of  the 
dominating  LNS  design  (ARP)  on  the  architecture  and  the 
issues  of  fault  tolerance  and  recovery  is  investigated  in 
Chapter  Ten.  In  Chapter  Eleven,  by  way  of  summarizing  the 
proceedings,  the  significance  of  the  described  research  is 
brought  out,  and  some  directions  for  future  research  are 
indicated.  Finally,  some  of  the  arithmetic  algorithms  are 
derived  in  Appendix  A,  and  lists  of  the  C-language 
programs  used  to  simulate  and  verify  experimentally 
various  theoretical  conjectures  are  given  in  the  rest  of 
the  Appendices  (B  - 0). 


CHAPTER  TWO 
SURVEY  OF  LITERATURE 


As  is  usually  the  case  with  basic  research,  theory 
for  Logarithmic  Number  System(s)  ( LNS ) remained  virtually 
unused  until  the  advances  in  semiconductor  memory 
technology  offered  the  vehicle  for  actual  implementation 
of  sufficiently  efficient  LNS  engines.  Work  related  to  LNS 
(mainly  conversion  routines)  dates  back  to  Briggs  three 
centuries  ago  [Sal54].  In  1971  Kingsbury  and  Rayner 
[Kin71]  first  outlined  logic  hardware  and  software 
function  approximations  to  the  addition  and  subtraction  of 
two  positive  numbers  logarithmically  encoded.  Swartzlander 
and  Alexopoulos  [Swa75]  followed  with  an  analysis  of  ROM- 
based  hardware  for  faster  addition  but  with  a wordlength 
limitation  of  12  bits.  Later  Lee  and  Edgar  [Lee77a-b] 
developed  computationally  efficient  supporting  algorithms 
for  microcomputer  control  and  signal  processing 
applications  to  be  implemented  using  8 and  16-bit 
logarithmic  arithmetic  on  microprocessors.  They 
established  that  unlike  floating-point  arithmetic,  which 
provides  accuracy  at  the  expense  of  speed,  the  LNS 
provides  for  both  and  is  suitable  for  Digital  Signal 
Processing  (DSP)  applications.  Kurokawa  et  al.  [Kur80] 
applied  LNS  to  the  implementation  of  digital  filters  and 
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demonstrated  that  it  gives  filtering  performance,  superior 
to  that  of  a floating-point  system  of  equivalent 
wordlength  and  range.  Similar  observation  was  made  by 
Swartzlander  et  al . [Swa83]  when  using  LNS  for  a Fast 
Fourier  Transform  processor.  Shenoy  [She83]  applied  LNS 
to  Least  Mean  Squared  Adaptive  digital  filters,  to  achieve 
increased  performance,  when  compared  to  fixed-point 
realizations.  These  studies  showed  that  all  the  basic 
arithmetic  operations  are  very  fast  and  easy  to  implement 
in  LNS.  The  technological  advances  in  memory  construction 
have  renewed  interest  in  the  LNS  research  over  the  last 
few  years.  A brief  survey  of  the  literature  in  the  LNS 
will  now  be  presented. 

2 . 1 Logarithmic  Conversion 

Cantor  et  al.  [Can62]  have  described  a special- 
purpose  structure  to  implement  sequential  table  look-up 
(STL)  algorithms  for  the  evaluation  of  logarithms.  Tables 
of  precomputed  constants  are  used  to  transform  the 
argument  into  a range,  where  the  function  may  be 
approximated  by  a simple  polynomial.  Their  work  was  based 
on  previously  developed  transformation  algorithms  by  Berner 
[Bem58].  Later,  Specker  [Spe65]  offered  alternative  STL 
algorithms  with  a smaller  number  of  constants  required, 
which  could  either  be  wired  into  a computer  structure  or 
programmed.  For  an  R-bit  word-format,  a basic  hardware 
configuration  required  R additions  and  R/2  shift 
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operations.  Mitchell  [Mit62]  proposed  a method  of  computer 
multipl ication  and  division  using  binary  logarithms  and 
performed  an  analysis  to  determine  the  maximum  errors  that 
may  occur  as  a result  of  the  approximation  used.  He 
determined  the  binary  logarithm  of  a binary  number  in 
hardware  from  the  number  itself,  by  using  simple  logical 
and  shift  operations  after  encoding  the  binary  number  into 
a form,  from  which  the  characteristic  is  easily  determined 
and  the  mantissa  easily  approximated.  This  work  was  later 
expanded  by  Combet  et  al . [Com65].  They  partitioned  the 
range  of  the  binary  numbers  into  four  parts  and  again  made 
a piecewise  linear  approximation.  The  linear  equations, 
which  they  used,  were  found  by  trial  and  error,  using  a 
criterion  of  minimum  error,  and  constraining  the 
coefficients  to  be  easily  implemented  with  binary 
circuitry.  That  is  the  coefficients  were  chosen  to  be 
fractions  with  integer  numerators  and  power  of  two 
denominators.  With  a four-subinterval  partition,  the 
single  division  error  was  reduced  by  a factor  of  six.  In 
Hall  et  al.  [Hal70]  the  authors  discuss  algorithms  for 
computing  approximate  binary  logarithms,  antilogarithms 
and  applications  to  digital  filtering  computations.  They 
define  the  coefficients  for  a linear  least  square  fit  of 
the  binary  logarithm  of  the  fractional  number  incremented 
by  one,  partitioning  again  the  range  of  numbers  into  four 
subintervals.  The  maximum  error  with  these  coefficients 
is  reduced  by  a factor  of  roughly  1.3,  but  two  additional 
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sums  are  required.  Marino  [Mar72]  presents  a parabolic 
approximation  method  for  generation  of  binary  logarithms, 
with  a precision  increased  over  Hall  et  al.  [Hal70]  by  a 
factor  of  2.5,  without  increasing  the  number  of  sums. 
Their  method  can  be  implemented  as  a subroutine  or  using  a 
hardware  peripheral  device  to  build,  with  almost  the  same 
precision,  an  approximate  generation  of  many  useful 
elementary  functions.  Kingsbury  and  Rayner  [Kin71]  propose 
the  use  of  linear  analog  to  digital  and  digital  to  analog 
converters  in  the  case  of  digital  filtering  employing  LNS . 
Conversion  of  digital  logarithms  to  analog  voltages  can  be 
achieved  by  supplying  an  amplifier  chain  from  a reference 
voltage  and  causing  each  bit  of  the  logarithm  to  switch 
the  overall  gain  by  a factor  depending  on  the  significance 
of  the  bit.  A/D  conversion  is  achieved  by  using  a high- 
gain  comparator  with  the  D/A  converter  to  approximate 
successively  the  converter  output  to  the  signal  input. 
The  use  of  an  interesting  RC  circuit  to  implement  the 
logarithmic  Analog  to  Digital  conversion  is  proposed  by 
Duke  [Duk71].  Chen  [Che72]  presents  unified  methods  for 
evaluation  of  exponentials,  logarithms,  ratios  and  square 
roots  for  fractional  arguments,  in  one  conventional 
multiply  time,  employing  only  shifts,  adds,  high-speed 
table  look-ups  and  bit  counting.  Briggs  used  decimal 
digit  by  digit  schemes  [Sal54]  for  evaluating  logarithms. 
Meggitt's  [Meg62]  unified  approach  was  based  on  Briggs' 
scheme,  which  was  further  improved  by  Sarkar  and 
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Krishnamurthy  [Sar71].  Walther  [Wal71]  showed  another 
unifying  algorithm  containing  the  CORDIC  schemes  proposed 
by  Voider  [Vol59].  The  elementary  functions  he  dealt  with 
are  inferred  from  hyperbolic  functions,  invoking 
repeatedly  a basic  shift  sequence.  A method  suggested  by 
de  Luggish  [Lug70]  can  be  quite  advantageous,  when  the  add 
time  is  the  cost  overriding  factor.  Very  recent  methods 
for  logarithm  generation  include  the  sequential  squaring 
method  by  Karp  [Kar84],  useful  for  large  scale  vector 
processors  and  limited  precision  microprocessors,  and  the 
Difference  Grouping  Programmable  Array  Logic  ( DGPLA ) 
method  discussed  by  Lo  and  Aoki  [Lo85].  The  latter  is 
based  on  the  work  of  Brubaker  and  Becker  [Bru75],  which 
utilizes  read-only  memories  (ROMs)  for  the  generation  of 
the  logarithm.  For  a predetermined  error,  the  number  of 
bits  can  be  chosen  for  transformations;  otherwise,  for  a 
predetermined  number  of  bits,  the  error  can  be  set  to  an 
optimal  minimum  value  by  the  adjustment  between  the  upper 
and  lower  transformations.  The  primary  drawback  arises 
from  the  large  number  of  bits  required  in  the  ROM,  if  one 
needs  high  precision  of  calculation.  Furthermore  the  input 
combinational  product  terms  of  a ROM  cannot  be  simplified. 
Lo  et  al.  proposed  the  use  of  DGPLAs  to  avoid  these 
shortcomings.  They  offered  an  algorithm  to  synthesize  the 
DGPLAs  by  reducing  the  calculation  of  transformation,  the 
location  of  the  break  points,  and  the  ranges  covered  by 
the  segments  in  an  optimal  condition.  The  same  method  can 
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be  also  applied  to  generating  antilogarithms  with  not  as 
good  results  though. 


2 . 2 Logarithmic  Arithmetic 
2.2.1  Multiplication  and  Division 

An  algorithm  for  computer  multiplication  (division) 
using  binary  logarithms  was  described  by  Mitchell  [Mit62]. 
A simple  add  (subtract)  and  shift  operation  was  required. 
Hall  et  al.  [Hal70]  offered  a refinement  of  this  method, 
accompanied  by  an  exhaustive  error  analysis  of  the  product 
and  quotient  errors.  A different  approach  was  followed  by 
Brubaker  and  Becker  [Bru75].  They  developed  design  curves 
for  ROMs  needed  to  generate  the  logarithm  and 
antilogarithm  transformations  necessary  for 
multiplication.  The  number  of  bits  for  a given  accuracy 
was  shown  to  be  less  than  the  one  required  for  a direct 
multiply.  A direct  ROM-based  multiplication  requires  one 
memory  access  while  logarithmic  multiplication  requires 
two  memory  access  times  plus  an  addition.  Similar 
algorithms  along  with  examples  and  hardware 
implementations  may  be  found  in  one  or  more  of  the 
following  references:  [Kin71,  Li80,  Lee77a,  Lee77b,  Lee79, 
Ma j 73 , Swa7 5 , Swa79,  Tay84b,  Tay85]. 
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2.2.2  Addition  and  Subtraction 

Kingsbury  and  Rayner  [Kin71]  pioneered  in  proposing 
two  methods  of  adding  or  subtracting  logarithmically 
encoded  numbers.  In  the  direct  method  of  addition  and 
subtraction,  they  use  approximate  evaluation  of 
algorithms.  However,  their  second  method  is  based  on  table 
look-ups  using  read-only  memory  (ROM),  which  leads  to 
potentially  faster  operations.  The  paper  also  indicates 
possible  methods  of  table  reduction.  In  1975,  Swartzlander 
and  Alexopoulos  [Swa75],  unaware  of  previous  work  in  the 
area  proposed  a sign/logarithm  number  system  along  with 
arithmetic  algorithms  identical  to  those  proposed  by 
Kingsbury  and  Rayner  [Kin71].  Their  paper  suggests 
hardware  architectures  for  arithmetic  units  and  includes 
comparison  of  speeds  with  conventional  arithmetic  units. 
LNS  was  reinvented  a third  time  in  a paper  by  Lee  and 
Edgar  [Lee77a],  but  enriched  with  ideas  and  programs  for 
implementation  in  8 and  16-bit  microcomputers.  They 
examined  more  rigorously  the  LNS  addition  and  subtraction 
algorithms  with  respect  to  storage,  efficiency,  accuracy, 
range,  noise  and  array  area  [Lee79].  In  a continued  effort 
to  research  efficient  algorithms  in  LNS,  Li  [Li80] 
presents  four  algorithms  for  addition  and  subtraction. 
This  report  compares  table  look-up  and  direct  methods  with 
interpolation  and  power  series  algorithms  with  respect  to 
memory  storage,  speed  and  accuracy.  Also  covered  in  this 
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work  is  a comparative  study  of  floating-point  and 
logarithmic  hardware. 

2.2.3  Table  Reduction  Techniques 

The  most  recent  algorithms  for  implementation  of 
addition  or  subtraction  in  a logarithmic  environment  call 
for  table  look-ups.  The  speed  of  operations  then  depends 
on  the  size  of  the  table  which  is  a function  of  the 
wordlength.  A brute  force  table  look-up  is  hindered  by  the 
monotonically  decreasing  nature  of  the  logarithmic 
addition  and  the  monotonically  increasing  nature  of 
subtraction.  A 8-bit  implementation  for  example  would 
require  2 K-bits  of  memory.  Pioneering  in  this  field  too, 
Kingsbury  and  Rayner  [Kin71]  suggested  two  methods  of 
reducing  the  storage  requirements.  One  is  to  store  in  the 
table  the  values  of  the  function  corresponding  to  every 
8th  (say)  value  of  the  input  and  to  interpolate  linearly 
between  these  points.  The  other  method  is  to  store  x at 
the  address  f(x)  instead  of  f(x)  at  x,  and  then  employ  a 
read-and-compare  succesive  approximation  process.  The 
current  state  of  the  art  in  designing  high-speed  ROMs  or 
RAMs  is  such  that  LNS  is  useful  for  high-speed  operations 
only  when  the  wordlength  is  limited  to  12  or  13  bits.  The 
implementation  of  table  look-up  in  hardware  has  been  much 
delayed  due  to  the  prohibitive  cost  of  memory.  The 
realization  of  a custom-  designed  VLSI  chip,  though, 
brings  into  reality  tables,  which  take  advantage  of  the 
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properties  of  the  function  to  be  realized.  A parallel- 
search  table  used  for  a pilot  VLSI  implementation  of 
addition  and  subtraction  algorithms  is  described  by 
Bechtolscheim  and  Gross  [Bec81],  The  parallel-search  table 
stores  pairs  of  (x,  y)  values  and  searches  the  set  of  x- 
values  for  a matching  interval.  By  searching  digit- 
sequentially , instead  of  word-parallel,  the  look-up  time 
of  a parallel-search  table  is  only  proportional  to  the 
length  of  the  x-values  stored.  Also  since  a read-only 
parallel-search  table  has  single-transistor  bit  cells,  it 
can  be  built  as  densely  as  a conventional  read-only 
memory.  As  a result  of  discrete  coding  and  the  nature  of 
the  functions,  a large  portion  of  the  output  values  is 
zero.  Lee  and  Edgar  [Lee79]  determine  a cutoff  value, 
after  which  the  input  is  mapped  to  a zero  output  value. 
The  number  of  zeros  depends  on  the  number  of  fractional 
bits  allotted  in  the  word  format  and  the  chosen  radix  of 
the  number  system.  Frey  and  Taylor  [Fre85]  suggest  an 
interesting  algorithm  to  reduce  the  size  of  the  table,  by 
recognizing  the  fact  that  the  discrete  encoding  returns  in 
a "staircase"  type  function,  wherein  a range  of  addresses 
will  return  the  same  output  value.  Taylor  [Tay83a] 
proposes  a linear  interpolation  scheme,  which  is  practical 
from  a hardware  realization  standpoint. 
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2 . 3 LNS  Applications 
2.3.1  Digital  Filtering  using  LNS 

The  first  (to  my  knowledge)  application  of  LNS  (other 
than  the  slide  ruler)  was  a digital  period-meter  for  a 
nuclear  reactor  designed  by  Eder  et  al.  [Fur64]. 
Kingsbury  and  Rayner  [Kin71]  implemented  a 2nd  order 
recursive  low-pass  filter  with  16-bit  logarithmic 
arithmetic  and  demonstrated  the  improvement  in  dynamic 
range  and  performance  over  a 16-bit  fixed-point  filter. 
Kurokawa  [Kur78]  applied  logarithmic  arithmetic  when 
implementing  digital  filters.  Sicuranza  offers  some 
preliminary  results  on  2-D  filter  designs  implemented  with 
LNS,  which  confirm  that  it  is  possible  to  design,  with 
sufficient  accuracy,  digital  filters  having  their 
coefficients  expressed  as  powers  of  some  base  a,  and  thus 
exploit  the  advantages  offered  by  LNS  [Sic81a-b,  Sic82]  . 
Finally  [Sic83],  he  reports  some  design  considerations 
based  on  the  so-called  "Implementation  Cost,"  measured  by 
the  total  number  of  bits  required,  and  presents  a few 
significant  examples  of  2-D  filter  approximation, 
including  a 2-D  wide-band  differentiator,  a circularly 
symmetrical  half-plane  low-pass  filter  and  a 90°  fan 
filter.  In  the  same  paper  he  offers  a base  optimization 
iterative  procedure,  proven  to  be  influencing  the 
convergence  rate  of  the  2-D  filters.  Hall  et  al.  [Hal70] 
apply  LNS  to  a recursive  digital  filter,  which  fails  to 
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compare  favorably  to  an  alternative  design  incorporating  a 
cobweb  array  multiplier.  However,  such  a favorable 
comparison  is  achieved  in  the  case  of  a parallel  digital 
filter  bank  with  more  than  four  subfilters.  Finally  they 
apply  LNS  to  the  multiplicative  filters  proposed  by 
Oppenheim  et  al.  [Opp68]. 

2.3.2  FFT  Implementation 

Swartzlander  et  al . [Swa83]  presented  a decimation  in 
time  sign/logarithm  FFT.  This  is  accompanied  by  an  FFT 
error  analysis  and  computer  simulation  results.  It  is 
shown  that,  besides  being  faster,  LNS  offers  an  improved 
error  performance,  when  compared  to  conventional  fixed  or 
floating-point  arithmetic. 

2.3.3  Other  LNS  Applications 

Swartzlander  and  Gilbert  [Swa80a]  analyzed  the 
requirements  for  the  new  generation  computed  tomography 
machines  and  compared  LNS  convolution  and  weighted  linear 
summation  units  to  similar  units  implemented  with  merged 
arithmetic  [Swa80b]  and  a two's  complement  modular  array. 
They  used  a figure  of  merit  relating  processing  speed  to 
complexity  and  demonstrated  that  the  LNS  approach,  which 
achieves  extremely  high  dynamic  range  by  use  of  constant 
relative  precision,  is  the  most  efficient  mechanization  of 
the  three  examined.  Kurokawa  et  al.  [Kur80]  applied  LNS  to 
recursive  digital  filters  and  performed  an  extensive  error 
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analysis  of  the  roundoff  error  accumulation,  based  on  the 
assumption  that  the  true  (unrounded)  result  of  the 
addition  of  two  numbers  is  uniformly  distributed  between 
the  lower  and  higher  limits.  Shenoy  and  Taylor  [She82] 
designed  a short  term  autocorrelation  using  LNS . In  a 
later  work  [She83,  She84]  a theoretical  error  analysis  of 
logarithmically  implemented  adaptive  digital  filters  is 
presented.  Simulation  was  used  to  compare  them  to  their 
fixed-point  counterparts  with  the  same  wordlength.  LNS 
filters  performed  better.  The  most  recent  effort  to 
design  an  LNS  engine  was  made  by  Lang  et  al.  [Lan85]. 
They  presented  integrated-circuit  logarithmic  arithmetic 
units,  including  adders,  subtracters,  multipliers  and 
dividers.  The  design  results  were  used  to  develop  a size 
and  speed  comparison  of  integrated  circuit  logarithmic  and 
fixed-point  arithmetic  units.  Interestingly  enough  they 
chose  an  LNS  base  less  than  one,  to  restrict  the  number  of 
product  terms  required  by  the  PLA  for  the  logarithmic 
adder  or  subtracter. 


CHAPTER  THREE 
LNS  AND  LNS  ARITHMETIC 

The  way  arithmetic  is  performed  using  LNS  is  heavily 
dependent  on  the  method  chosen  to  represent  the  real 
numbers  and  the  base  of  the  logarithms.  Several  number 
representations  have  been  proposed.  In  this  dissertation 
the  most  widely  adopted  representation  has  been  chosen  and 
a proof  that  the  choice  of  base  is  immaterial  is  given. 
Several  basic  LNS  arithmetic  operations  have  been  also 
outlined . 

3 . 1 LNS  Description 

Many  different  schemes  have  been  adopted  to  represent 
numbers  in  LNS.  In  several  works  [Swa75,  Lee77b]  , a 
sign/logarithm  representation  has  been  chosen,  where  the 
LNS  exponent  is  a fixed-point  number  with  an  integer  and  a 
fractional  part.  Both  the  number  and  the  exponent  are 
signed.  The  number  is  represented  in  sign  magnitude  and 
the  exponent  in  offset  binary  number  system.  This  way  any 
negative  logarithms,  causing  problems  in  further 
computations,  are  avoided.  This  introduces  some  overhead 
when  multiplication  and  division  are  performed  and 
therefore  it  is  not  used  here. 
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In  most  of  the  work  regarding  theory  and 
implementation  of  LNS,  the  base  of  the  system  is  selected 
to  be  r = 2.  This  offers  obvious  advantages,  but  some 
researchers  [Lan85]  have  proposed  bases  that  are  between  0 
and  1,  guaranteeing  that  the  LNS  exponent  will  be 
positive,  when  its  magnitude  is  absolutely  bounded  by  1, 
as  is  the  case  with  the  signals.  In  general,  this  requires 
prenormalization  of  the  data.  Some  others  [Sic83]  have 
offered  base  optimization  iterative  procedures,  which  are 
positively  affecting  the  applications  they  are  considered 
for.  An  error  criterion  has  been  established  in  section 
3.1.2  of  this  dissertation,  according  to  which  a base 
optimization  answer  is  derived. 

3.1.1  Logarithmic  Number  System  Representation 

For  a real  number  X = ±r±x,  its  LNS  representation 
can  be  given  by  a (N+2)-bit  exponent  word  of  the  form 


k 


< I ► < F > 

where  Sv  is  the  overall  sign  of  the  real  number  X and  S 
is  the  sign  of  the  exponent.  This  way  it  is  possible  to 
represent  logarithmically  even  negative  numbers. 
Computation  of  Sx  is  operation  dependent,  requiring 
support  from  comparators  and/or  multiplexers  and/or  XOR 


20 


gates.  Computation  of  can  be  performed  as  part  of  the 
computation  of  the  LNS  exponent  itself.  Of  course,  if 
internal  exponent  representation  (sign/magnitude  for 
example)  does  not  allow  for  automatic  sign  computation, 
again  comparators  and  gates  have  to  be  provided  for  that. 
From  the  N weighted  bits  of  the  exponent,  F bits  are 
assigned  as  fractional  bits  and  N - F = I bits  are 
assigned  to  be  the  integer  bits.  More  specifically,  the 
absolute  value  of  the  LNS  exponent  x is  given  by 

X ■ E ai2ti'F)  f »i  = z2  f3-1' 

i = 0 

By  distributing  the  appropriate  number  of  bits  for  the 
integer  and  fractional  parts,  both  large  dynamic  range  and 
high  precision  can  be  achieved.  More  specifically,  this 
system  is  characterized  by 

(2'F(2N-1)1  I 

• Largest  positive  magnitude  : |xlmax  = rv  ' = r 

r-2-F(2N-D)  -i 

• Smallest  positive  magnitude:  I x I . =r v ' ~ r z 

J 1 'min 

f2"F+1(2N-l)l  i+l 

• Range  = Largest/Smallest  : r K J = rz 

So  for  example,  if  I = 6,  F = 9,  and  r = 2,  one  finds 

5.42  x 10-20  < | X | < 1.84  x 1019  ; Range  = 3.39  x 1038 
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Observe  that,  like  in  floating-point  systems,  there 
is  a "dead  zone"  around  zero,  which  can  be  closed  by 
increasing  the  number  of  fractional  bits  of  the  LNS 
exponent . 

3.1.2  LNS  Base  Optimization 
Theorem 

If  a)  the  dynamic  range  to  be  covered  by  the 

logarithmic  number  system  is  +V,  b)  the  LNS  wordlength  (in 

bits)  is  N and  c)  for  a random  variable  x representing  the 

LNS  exponents,  a model  for  the  maximum  error  budget  is 

described  by  the  function  E(r)  = rx^re^r^  - lj  with 
— F ( r ) 

e(r)  = 2 , where  F(r)  is  the  number  of  fractional 

bits  required  for  the  specific  case  of  base  selection, 
then  there  is  no  real  base  minimizing  the  error  function. 


Proof 


Suppose  that  for  a certain  base  r,  the  number  of 
integer  bits  available  is  I',  whereas  it  is  I for  base  - 


2.  Then  the  dynamic  range  is  expressed  as  V 
Then 


21  = 21  lgr  =>  I = I'  + lg(lgr)  =► 


I'  = I - lg(lgr)  =*  F(r)  = N - I ' = N-  I + lg(lgr) 


To  minimize  E(r),  the  derivative  is  evaluated 
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«G(r)  for  all  r. 

f-N  + I - lg(lgr)) 

For  G(r)  to  be  zero,  then  2K  ' = 0 => 

-N  + I -lg(lgr)  = =>  r = ®. 

In  other  words,  there  is  no  real  base  r minimizing  the 
considered  criterion  of  optimality.  However,  the  factor 

( 0 ( £ ) 'S  t V 

|r  - 11  by  which  every  real  number  X = r is 

multiplied  assumes  the  same  value  for  all  real  bases  r. 
Any  base,  like  r = 2,  r = e,  r = 10,  etc.  can  be  used 
without  altering  the  error  budget.  Consequently,  the 
field  is  open  for  application  of  any  other  optimization 
schemes,  imposed  by  hardware  or  software  requirements.  The 
above  analysis  was  verified  experimentally  with  some  of 
the  test  values  reported  in  Table  3.1,  where  the  various 
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parameters  have  the  following  values: 

1 c 9 ^ 

N = 12,  V = 21D  = 2Z  4 


and  F(r)  = 12  - 4 + lg(lgr) 


TABLE  3.1 

Error  Values  for  Several  LNS  Bases  and 
for  the  Same  Dynamic  Range 


r 

F(r) 

re<r>  - 1 

16 

10.0 

2.711275  x 10~3 

4 

9.0 

2.711275  x 10"3 

2 

8.0 

2.711275  x 10-3 

n 

7.0 

2.711275  x 10-3 

e 

8.5287 

2.711275  x 10-3 

3 . 2 LNS  Arithmetic  Operations 

Following  the  previously  adopted  number 

representation,  arithmetic  in  LNS  is  described  below.  For 
the  fastest  execution  of  operations  like  addition  or 

subtraction,  table  look-up  operations  have  been  employed 
to  support  them.  In  the  following  discussion  of 
operations  the  numbers  are  represented  as 

A ->  SAra,  B -»  SQrb , C ->•  Scrc,  E ->  Sj.rs,  A ->  SAr& 

H ->  SHrh,  E ->  SEre,  M SMr^,  T ->  STrT,  V -♦  Syrv 


24 


3.2.1  Multiplication  and  Division 

For  C = A x B =>  c <-  a + b and  Sr  = S © Sn 

A D 

(3.2) 

For  C - A + B =>  c <-  a - b and  Sr  =*  S.  © Sn 

where  © denotes  an  "exclusive  or"  operation.  It  is  obvious 
that  multiplication  and  division  in  the  conventional 
systems  are  substituted  by  addition  and  subtraction  in 
LNS , resulting  in  very  fast  operations.  Overflows  and 
underflows  can  be  very  easily  detected,  by  simple 
comparisons  with  the  largest  and  the  smallest  numbers 
representable  by  the  system.  The  flowchart  of  the  LNS 
multiplication/division  is  shown  in  Figure  3.1. 

3.2.2  Addition  and  Subtraction 

Without  loss  of  generality  one  can  assume  that 
A > B. 

Then 


for  Z = A + B 

s <-  a + $ ( b — 

a) 

and  S j. 

= SA 

with 

*(b  - 

a) 

= lrfl  + 

rb-a 

For  A = A - B 

6 <-  a + Y(b  - 

a) 

and  S . 

= SA 

wi  th 

Y(b  - 

a) 

- lr(x  - 

rb-a 

For  the  special  case  when  A = B,  then  S <-  0 , and  S^  = 0. 

The  flowchart  for  LNS  addition/subtraction  is  offered  in 
Figure  3.2. 
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FIGURE  3.1 

Flowchart  for  LNS  Multiplication/Division 
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FIGURE  3.2 

Flowchart  for  LNS  Addition/Subtraction 
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Historically,  addition  and  subtraction  have  been  the 
principal  obstacles  to  developing  versatile  LNS  engines. 
It  can  be  seen  that  both  operations  involve  logarithmic 
mappings  and  regular  additions  as  well.  The  values  of 
$(v)  and  Y(  v ) , where  v = | a — b | , can  be  obtained  as 
memory  table  look-ups.  Based  on  contemporary  high-speed 
memory  chips  limitations  (see  Table  3.2)  the  wordlength  of 
v can  be  estimated  to  be  on  the  order  of  12  bits.  This  has 
been  the  historical  limit  to  its  precision.  Nevertheless, 
it  has  been  shown  to  be  potentially  powerful  in  the  class 
of  digital  filtering  problems,  in  which  the  speed,  nature 
of  signals  and  component  count  offset  any  accuracy 
considerations.  For  example,  real  time  digital  filtering 
of  radar  video  for  moving  target  detection,  synthetic 
aperture  processing,  and  pulse  compression  are  in  this 
class.  Because  of  the  statistical  nature  of  the  sampled 
signals,  the  large  amount  of  signal  integration  required 
and  the  characterization  of  detection  performance  on  a 
probabilistic  basis,  the  accuracy  of  a single  computation 
has  less  importance  than  the  mean  and  variance  of  the 
operation  on  the  signal  ensemble.  Radar  video  is 
characterized  by  broad  bandwidth  and  corresponding  high 
data  rates,  which  make  real  time  multiplication  with 
readily  available  logic  very  difficult.  Furthermore, 
multiple  filters  are  usually  required  because  the  noise  is 
colored  and  the  filter  bandpass  is  but  a small  fraction  of 
the  actual  signal  bandwidth.  Examples  like  the  one  just 
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TABLE  3.2 

Commercially  Available  High-Speed  Memory  Chips 


Size 

Access  Time 

( ns ) 

Technology 

Family 

Pins 

IK  x 4 

35 

TTL 

18 

p 

2K  x 4 

35 

TTL 

18 

R 

IK  x 8 

30 

TTL 

24 

0 

2K  x 8 

35 

TTL 

24 

M 

4K  x 4 

35 

TTL 

20 

S 

4K  x 8 

40 

TTL 

24 

8K  x 8 

40 

TTL 

24 

IK  x 4 

8 

ECL 

24 

30 

TTL 

16 

4K  x 1 

15 

ECL 

18 

S 

10 

ECL 

20 

55 

CMOS 

18 

R 

2K  x 8 

45 

CMOS 

20 

4K  x 4 

15 

ECL 

28 

A 

55 

CMOS 

20 

8K  x 8 

55 

CMOS 

20 

M 

16K  x 1 

15 

ECL 

20 

35 

CMOS 

20 

S 

16K  x 4 

35 

ECL 

20 

64K  x 1 

55 

CMOS 

22 

40 

NMOS 

22 

64K  x 8 

55 

TTL 

22 

Data  are  taken  from  the  DATABOOKS  84/85  of: 
AMD,  FAIRCHILD,  NEC,  HITACHI,  MOTOROLA 
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given  really  prove  that  even  precision  "handicapped"  LNS 
engines  are  valuable  in  certain  cases. 

PLAs  have  been  also  used  in  place  of  ROMs  and  RAMs  by 
some  designers.  An  8-bit  single  chip  3 pm  CMOS  that  makes 
extensive  use  of  PLAs  has  been  reported  [Lan85]. 

3.2.3  Other  Arithmetic  Operations 


Some  more  operations  include 
Squaring  : For  T = A2  # 

Square  rooting  : For  C = A*5  •* 

Exponentiating  : For  E = AB  *» 


t <-  2a  and  ST 

c *-  ^ and  Sc 
b 

e f ar  and 


S 


E 


SA  for  A>0  or  ( A< 0 and  B odd) 
-SA  for  A< 0 and  B even. 


(3.4) 


The  exponents  of  some  hyperbolic  trigonometric  functions 
are  given  as 


Hyp. 

cosine  : 

For 

H *= 

cosh 

X 

h 

<- 

z + 4>  ( 2 z ) 

- Ir2 

Hyp. 

sine  : 

For 

M = 

sinh 

X 

=► 

V 

<- 

z + Y(2z) 

- 1 r2 

Hyp. 

tangent  : 

For 

T = 

tanh 

X 

T 

♦- 

Y(2z)  - 

*(2z) 

Hyp. 

cotangent : 

For 

C = 

coth 

X 

4 

C 

*(2z)  - 

( 

Y(2z) 

Hyp.  secant 


: For  A 


sech  X # a *-  -z 


$(2z)  + 1 r2 
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Hyp.  cosecant  : For  V = csch  X =>  v <-  -z  - Y(2z)  + lr2 

with  z - rsf?  - x lr<e>  - lr(eX)- 

For  a derivation  of  the  above  relationships  see  Appendix 
A.  They  bring  into  light  some  of  the  advantages  of  LNS . 
For  example,  squaring  and  square  rooting  require  but  a 
mere  shift  of  the  LNS  exponent  to  the  left  or  to  the  right 
respectively.  A basic  LNS  architecture,  based  on  the  above 
equations,  is  shown  in  Figure  3.3.  The  computation  of 
regular  and  hyperbolic  trigonometric  functions  proves  to 
be  extremely  simple  and  without  imposing  extra  hardware 
burdens.  Required  is  only  an  extra  table  for  the 

generation  of  z (defined  in  equations  (3.5)).  After  z is 
generated,  it  will  be  shifted  by  1 bit  and  presented  to 
the  tables  4>  and  Y,  according  to  the  function  that  has  to 
be  implemented.  For  a stand-alone  or  integrated  hyperbolic 
trigonometric  LNS  processor  the  complex  trigonometric 
functions  would  not  require  more  time  than  a LNS  addition. 
In  Figure  3.4  the  architecture  of  an  LNS  hyperbolic 
trigonometric  processor  is  given.  It  can  be  seen  that  this 
design  can  be  easily  integrated  with  the  basic  LNS 
architecture  offered  in  Figure  3.3.  A simple  CORDIC 
replacement  LNS  alternative  architecture  is  also  described 
in  Chapter  Nine  (see  Figure  9.2).  These  two  architectures 
combined  form  the  basis  for  a complete  LNS  trigonometric 
processor.  The  regularity  of  the  data  flow  is  remarkable 
and  well  suited  for  the  VLSI  design  of  LNS  trigonometric 


processor . 
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FIGURE  3.3 

Basic  LNS  Arithmetic  Unit 
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sech  coth  csch 


FIGURE  3.4 

LNS  Hyperbolic  Trigonometric  Processor 


CHAPTER  FOUR 

CONVERSION  TO  AND  FROM  THE  LNS 


In  previous  sections,  a case  was  made  to  justify  the 

LNS  processor  as  a special-purpose  processor.  In  some 

cases,  it  can  be  used  directly  after  the  acquisition  of 

the  signal,  without  any  number  system  conversion.  Another 

case  is  where  the  LNS  processor  is  used  as  a specialized 

arithmetic  machine.  Such  is  the  case  of  a stand-alone 

divide  unit,  or  a trigonometric  processor  (CORDIC 

replacement  unit).  This  will  be  discussed  in  later 

chapters.  In  a dedicated  LNS  architecture,  data  need  only 

be  converted  to  and  from  LNS  at  the  input/output  boundary. 

Within  the  LNS  machine,  data  are  manipulated  in  a LNS 

format  and  do  not  require  that  a big  conversion  overhead 

penalty  be  paid.  A typical  example  of  such  a case  is  the 

2 

hybrid  floating-point  and  logarithmic  processor  ( FU ) 

[Tay85],  which  is  shown  in  Figure  4.1.  Floating-point 

number  representation  is  the  dominating  choice  of  systems 

designers,  when  a large  dynamic  range  and  high  precision 

are  simultaneously  required.  However,  there  are  two 

problems  associated  with  this  kind  of  arithmetic.  Besides 

being  slow,  it  offers  a hardware  utilization  of  as  low  as 

50  percent.  This  is  the  result  of  two  different  addition 

2 

and  multiplication  paths.  The  (FU)  consists  of  three 
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FORWARD  CODE  LNS  ARITHHETIC  SECTION 

CONVERTER 


ADD 


FORWARD  CODE 
CONVERTER 

ADD 


f 


ADDITION  PATH 


♦ or  T ^ fT\  . 

ROM 


V - I X - y 1 


4) 


C - x ♦ y 


MULTIPLICATION  PATH 


FIGURE  4.1 

Hybrid  Floating-Point  Logarithmic 


INVERSE  CODE 
CONVERTER 


Processor 
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architecturally  distinct  sections:  a)  the  FLP  to  LNS 

converter,  b)  the  LNS  arithmetic  unit,  and  c)  the  LNS  to 
FLP  converter.  It  has  been  shown  to  possess  a number  of 
attractive  attributes: 

• No  alignment  is  required  for  addition  or  subtraction. 

• The  multiplier  is  a subset  of  the  adder. 

• The  data  flow  is  highly  regular,  because  all 
operations  occupy  the  same  interval  of  time, 
regardless  of  the  relative  values  of  the  operands. 

4 . 1 Conversion  from  FLP  to  LNS 

In  a floating-point  environment,  a real  number  X can 
be  approximated  as 

X = SXmx rX ' ; 7 < mx  < 1 (4.1) 

where  Sx  is  the  sign  of  the  number,  r is  the  base,  mx  is 
the  unsigned  M-bit  mantissa  and  x'  is  the  (E+l)-bit  signed 
exponent.  For  a X = Sxrx  LNS  representation,  discussed  in 
Chapter  Three,  with  an  (N+2)-bit  exponent  and  a memory 
table  0 offering  at  its  output  the  logarithm  to  the  base  r 
of  its  input,  x is  given  by 

x = x'  + Ax  ; Ax  = 0(mx)  (4.2) 

If  r = 2,  then  0 . 5 < mx  < 1 and  Ax  < 0 . 

Several  conversion  schemes  have  been  proposed  and 
implemented.  The  most  recent  one,  offered  by  Lo  and  Aoki 
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[ Lo85 ] , uses  table  look-ups.  It  employs  Difference 
Grouping  Programmable  Logic  Array  ( DGPLAs ) to  simplify  the 
number  of  input  combinational  product  terms.  The 
architecture  of  the  FLP  to  LNS  converter  is  shown  in 
Figure  4.2.  For  a base  r = 2,  the  presence  of  the  adder 
is  not  really  necessary,  since  Ax  will  always  range 
between  -1  and  0.  Therefore,  the  memory  table  can  be 
programmed  to  directly  output  the  fractional  part  of  the 
the  2's  complement  version  of  Ax,  which  can  then  be 
concatenated  to  x'-l  in  order  to  result  in  x.  Of  course, 
to  form  x'-l  requires  one  add  time  but  this  operation  may 
be  postponed  to  the  next  stage  of  the  (FU)2.  There,  if 
the  operation  to  be  performed  is  division,  the  operation 
x — 1 can  be  totally  eliminated,  since  division  of  real 
numbers  requires  subtraction  of  the  two  LNS  exponents 
involved.  Therefore,  for  a divide-only  unit,  the 
conversion  time  from  FLP  to  LNS  is  only  one  look-up, 
instead  of  one  look-up  plus  one  add  that  is  required  in 
general.  An  error  analysis  (based  on  the  assumption  that 
the  digital  number  presented  to  the  DGPLA  is  the  original 
one)  is  also  offered  in  Lo  and  Aoki  [Lo85].  An  error 
analysis,  not  restricted  by  the  above  assumption  is 
presented  in  the  next  section.  It  results  in  an  analytical 
expression  for  the  probability  density  function  (p.d.f.) 
of  the  conversion  error. 
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X=S.,m  rx 
X x 


-♦  LNS 
EXPONENT 


FIGURE  4.2 

Architecture  of  an  FLP  to  LNS  Converter 
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4 . 2 Error  Analysis 

The  only  source  of  error  in  forming  x is  the  one 

resulting  from  the  logarithmic  mapping  Ax  = 0(mx),  which 

is  performed  as  a memory  table  look-up  operation.  The 

proposed  error  analysis  model  is  illustrated  in  Figure 

4.3.  There,  the  two  paths  associated  with  the  forming  of 

Ax  and  its  finite  wordlength  approximation  are  shown  as 

parallel  ones.  Upon  receiving  the  FLP  mantissa  m^ , the 

lower  path  provides  the  ideal  mapping  Ax  = 0(mx),  while 

the  upper  part  consists  of  an  Input  Quantizer  (QI,  which 

★ 

provides  a discrete  value  of  mx  denoted  mx , ) ; then  an 

ideal  mapping  of  m*  into  L = 0(m*)  and  finally  an  Output 

Quantizer  (QO,  which  provides  the  machine  version  of  L, 
* 

namely  Ax  = RND [ L ] ) . The  input  and  output  quantization 
errors  are  then  defined  as 


E_  = m - m 
I x x 


E0  = 0(m*)  - [RND[0(m*)  jJ 


(4.3) 


where  E^.  and  EQ  are  uniform  white  noise  sequences 
possessing  the  following  statistical  properties: 

2 

EK  Ei]  - E[0(<>  Eo]  - 0 ; E[Eik  Eij]  * T§  6kj 

(4.4) 


E[ Ex ] = E[E0]  - 0 ; 


E E 


Ok  Oj 


1 ^ 
jJ  “ TZ 


S,  • 


-N  -N 

Here,  q^  = 2 and  q^  = 2 , where  Nj  and  Nq  are  the 

numbers  of  bits  available  at  the  input  and  output  of  the 
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FIGURE  4.3 

Error  Model  for  FLP  to  LNS  Conversion 
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logarithmic  mapping  L,  and  5^  is  the  Kroenecker  delta. 
In  addition  we  assume  that  mx  e uj^-»  lj  . The  final  error 

Et 

metric  E is  found  by  using  the  parameter  D = — as 

x 

E = Ax*  - Ax  = 0(mx+EI)  + EQ  - 0(mx)  = lr(l+D)  + EQ  (4.5) 

Through  the  application  of  theory  of  a function  of  a 
random  variable  [Pap65],  the  p.d.f.  of  E is  found  to  be 


VD)  ■ J lmxl  (DlW  d” 

J _oo  I ' X 


- | I rax  I ^E<Dmx^ni 
J -oo  J x 

= 2 f m / _ ( Dm  ) dm 
x JEJ'  X ' x 


dm 


(4.6) 


or 


VD) 


- r 

~ l^m  dm 
qT  Ju  X x 


m dm 
x x 


- f1- 

[I  K 


, m dm 

qT  .I.,  x x 


nlD 


r 

<3i  J}*  x x 


m dm 


for  - qT  < D < - 


for  - -j—  < D < 0 


for  0 < D < -j— 


for 


^ < D < q3 


(4.7) 


or 
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VD) 


4D 


3 

1 ^ 


4D“ 


*Ti 


for 


for  - 


for 


qx  < D < - 


qI  qI 

T~  * D * T~ 


(4.8) 


IT  * D * q] 


For  a better  understanding  of  the  algebra  involved,  the 
following  parameters  are  defined 


J = lr[l  - qj 


K = lr 


1 - 


qll 


M = lr 


1 + 


qll 

IT 


(4.9) 


N = lr(l  + qj 


and  T- 


T ± T~ 


+ 

for  T - J,K,M,N  and  E~  - E ± 


Proceeding,  a new  random  variable  P is  defined  as 

P = lr ( 1 + D) . 

Then  D = rP  - 1 and 


/P(P>  = 


VD) 

liZi 

1 3D  > 


logr  /d(d) 


rP  logr  /n(rP-l) 


(4.10) 


or 


/P(P)  = 


P 

r logr 


P J 

l4(rF-l),‘ 


Pi  3 

r l09r 


D 

r logr 


4(rp-l)2 


for  J < P < K 


for  K < P < M (4.11) 


for  M < P < N 
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Assuming  that  P and  are  statistically  independent 

random  variables,  one  can  find  for  the  final  error 
E = P + Eq  that 

/e(E)  = f /P(p)/E  (E-P)  dP  (4.12) 

J _oo  0 


or 

fE(E) 


A(P )/_  (E-P)  dP 
E0 

B(P)/  (E-P)  dP 

E0 

A(P )/_  (E-P)  dP 


+ 


+ 


(4.13) 


with  A ( P ) 


p 

r logr 


.4  ( rP-l ) 2 


B ( P ) 


-^-rPlogr 


(4.14) 


J P 


A(P)dP  = 


T~ 


ra-rb 


L ( 1 — r b ) ( l-ra ) 


ra-rb]  rl 

“**T  ' Je 


B ( P ) dP 


3 ( rb-ra ) 


where  a,  b are  defining  the  domains  for  A(P)  and  B(P). 
Depending  upon  the  relative  positions  of  a,  b and  the 
limits  of  the  integrals  in  Equation  (4.13),  one  can 
distinguish  between  the  following  cases: 

Case  i)  qQ  < qz. 

Then  by  applying  some  straightforward  algebra  one  finds 
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/ 


/e(E)  = 


r A(p)dp 

j j 


r. 


A ( P ) dP 


E 

rK 


*0 


r_ 


A(P)dP  + | B ( P ) dP 
K 


J 


J B ( P ) dP 

E~ 


*0 


rE  rM 

A(P)dP  + B ( P ) dP 

E_ 


r. 

E 

r. 


A ( P ) dP 


A ( P ) dP 


for  J < E < J+ 
for  J+  < E < K- 


for  K < E < K+ 


for  K+<E<M  (4.15) 


for  M < E < M+ 


for  M+  < E < N 
for  N~  < E < N+ 


or  after  substituting  for  the  values  of  the  integrals 
involved  from  equation  (4.14) 
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fE(E) 


*1 

1 1 

1 

E+  i 

r +q:-l 

^0 

. E+  *1 

1-r 

, E . E 

1-r  1-r 


1 


E+  E" 
r -r 


J <E  < J 


J+<E<K' 


(4.16a) 


2 1 

1 

1 Ql  rE 

qI  i E" 

4<3i<30 

1 T~  r 

1-r 

+ 4,3l'3o 


r - 1 + 


T~ 


K <E  <K 


E ^ 


f 3 

E+  E 

l 

r - r 

k+<e<m‘ 


(4.16b) 


/e(E) 


2 + — L 


Qt  F + 

1 l-rE 


^OL 


r - 1 


1~ 


+-. 


. A <>1  E+ 
1+2 r 


. E - E 

1-r  1-r 


4<Ji<j0|- 


rE+-rE' 


1 1 

1 

1 r E++n 

^0 

l-rE_  ^ 

i r +qJ. 

M <E  <M 
(4.16c) 


M+<E<N‘ 


N <E  <N 


This  complicated  piecewise  linear  continuous  expression 
can  be  approximated  by  using  two  distinct  first-order 
polynomial  approximations.  The  simplified  p.d.f.  is  then 
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given  by  the  curve 

1 


fE(E)  = 


1 K+-J 


[E  - J ] 


rE+-rE" 


2 M~-N+ 


1 [E  - N+] 


for  J < E < K' 


for  K+  < E < m (4.17) 


for  M < E < N 


with  A ^ = 


fl  qll 

qo  1 

1~T~ 

r -1 

' <3,  N 


; and  = 


~T~ 

3 r ^ ; 

[^] 

1 

lc3r<3n 

The  precise  p.d.f.  for  the  final  error  is  depicted  by 
the  solid  line  in  Figure  4.4.  The  approximating  curve 
(generated  by  the  code  found  in  Appendix  F)  is  outlined  in 
Figure  4.4  by  the  dotted  line.  Simulation  was  also  used  to 
verify  the  theoretical  results.  The  histogram  of  the  error 
E was  obtained  first.  Next  the  histogram  values  were 
divided  by  the  total  number  of  samples  and  the  step  of  the 
histogram.  The  result  was  compared  to  that  of  the  precise 
theoretical  curve  obtained  for  the  p.d.f.  of  the  error. 
The  two  curves  remarkably  coincided.  The  interrupted  line 
in  Figure  4.4  stands  for  the  experimental  curve.  The 
simulation  was  run  on  a VAX  11-750  machine  and  the  total 
number  of  samples  was  taken  to  be  30,000  for  all 
experiments.  The  p.d.f.  curves  for  Nj  - 7 and  NQ  = 8 are 
shown  in  Figure  4.4.  The  program  used  for  the  simulation 
of  the  logarithmic  encoder  is  listed  in  Appendix  B,  while 
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Q 1 - 7 
Q2  - 8 


Ymax  - 75 
Xma  x - 0.0131 
Samples  - 30000 


FIGURE  4.4 

P.d.f.  for  the  Error  in  FLP  to  LNS  Conversion.  Case  i) 
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the  one  used  for  the  graphical  depiction  of  the 
theoretical  curve  and  comparisons  with  the  experimental 
one  is  listed  in  Appendix  C. 


Case  ii)  qQ  - q:  . 


Then,  omitting  the  intermediate  calculations  one  finds  for 
/E(  E ) 


Je(e) 


A ( P ) dP 


A ( P ) dP 


r_ 


A ( P ) dP 


r 


B ( P ) dP 


I 


A ( P ) dP 


A ( P ) dP 


r. 


A ( P ) dP 


+ r B ( P ) dP 


+ r b ( p ) dp 

JK 


rM 

+ B ( P ) dP 

E~ 

rM 

+ B ( P ) dP 

E~ 


for  J < E < K~ 
for  K-  < E < J+ 
for  J+  < E < K+ 


for  K+  < E < M (4.18) 


for  M < E < N- 
for  N~  < E < M+ 
for  M+  < E < N+ 


Again,  the  above  theoretical  curve  (generated  by  the  code 
found  in  Appendix  D and  represented  by  the  solid  line  in 
Figure  4.5)  was  tested  against  the  experimental  one, 
obtained  for  a sample  size  of  20,000  for  Nj  = NQ  « 


12,  and 
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Input  Bits  * 12 
Output  Bits  - 12 
Y_max  « 2600 
X_min  • -0.000474 
X_max  * 0.000474 

Sample  Size  “20000 


FIGURE  4.5 

for  the  Error  in  FLP  to  LNS  Conversion.  Case  ii) 
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the  result  is  absolute  coincidence. 

Case  iii)  qQ  = 1 + qx . 

Again,  omitting  the  intermediate  calculations  one  finds 


for  /e(e) 


- r 

*0  Jo 


A ( P ) dP 


for  J < E < K' 


/e(E) 


rK 

rE 

A(P)dP 
J J 

+ 

B ( P ) dP 
JK 

for  K 

< E 

< 

J + 

rK 

rE+ 

A(P)dP 

+ 

B ( P ) dP 

for  J + 

< E 

< 

J 

E 

•*  K 

rK 

r 

■E  + 

rE+ 

A(P)dP 

+ 1 

B ( P ) dP  + 

A(P)dP 

J _ 
E 

J 

K 

for 

M“  < E 

< K + 

(4 

rE+ 

rM 

A(P)dP 

+ 

B ( P ) dP 

for  K+ 

< E 

< 

N~ 

J m 

J 

E 

fN 

pM 

A(P)dP 
J jvj 

+ 

B ( P ) dP 

for  N- 

< E 

< 

n+ 

J 

E 

i rN 

i0  J _ 


A ( P ) dP 


for  M+  < E < N+ 


Again  the  above  theoretical  curve  (generated  by  the  code 
found  in  Appendix  E and  represented  by  the  solid  line  in 
Figure  4.6)  was  tested  against  the  experimental  one, 
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obtained  for  a sample  size  of  30,000  for  Nj  - 8,  and  NQ  = 
7,  and  the  result  is  absolute  coincidence.  This  last  case 
is  of  particular  interest,  since  the  precision  offered  by 
QI  is  higher  than  that  of  QO  (Nj  > NQ)  by  at  least  1 bit 
in  the  case  of  r = 2.  This  is  a result  of  the  fact  that 
the  leading  bit  of  the  mantissa  is  always  1,  and  therefore 
can  be  omitted  from  being  presented  to  the  table. 

4.3  Conversion  from  LNS  to  FLP 


Again,  software  routines,  as  perhaps  the  ones  found 

in  [Che72],  can  be  used  to  convert  an  LNS  exponent  to  a 

FLP  number.  A hardware  realization  would  also  require  a 

table  look-up  operation  to  assist  in  mapping  x to  X - 
x ' 


m r 
x 


More  specifically 


r*l  ; 


mx  <r  Q(x'-x)  where  2(t) 


The  LNS  to  FLP  conversion  architecture  is  shown  in  Figure 
4.7.  Of  course,  [x]  - x is  1 - x , where  x_  is  the 

r F 

fractional  part  of  x.  The  computation  of  1 - x is 

F 

equivalent  to  computing  the  2's  complement  of  x or 

1 + xF.  As  in  the  case  of  FLP  to  LNS  conversion,  for  r=2 
the  addition  is  not  necessary,  if  the  memory  is  programmed 
to  output  directly  Q(x+1),  rather  than  2(x).  The  ceiling 
function,  which  is  necessary  to  form  fx] , based  on  an 

incrementer  is  fast  enough  not  to  be  critical  to  the 

timing.  The  conversion  procedure,  then,  takes  one  table 
look-up  time.  Software  interpolation  is  not  recommended 
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Inp.'.  Bits  : 8 
Out ( •■ . t Bits  : 7 
V_rr.(j  . : 140 

X_m i n : -0.009553 

X_rro  < : 0.009531 

Second  case  for 

f (E) 

E 


FIGURE  4.6 

P.d.f.  for  the  Error  in  FLP  to  LNS  Conversion.  Case  iii) 
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LNS 

EXPONENT 

C 


C 


C-SCmcr 


c ' 


FLP  NUMBER 


FIGURE  4.7 

Architecture  of  an  LNS  to  FLP  Converter 
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in  this  case,  since  the  implementation  of  the  exponential 
function  is  slower  than  that  of  the  log  function.  Using 
Chen's  algorithm  for  example,  it  would  require  12  add  and 
6 look-up  times  versus  only  1 look-up  time  (of  a larger 
and  consequently  slower  table  though)  when  using  a ROM. 


CHAPTER  FIVE 

MEMORY  TABLE  REDUCTION  TECHNIQUES 
5 . 1 Essential  Zeros 

The  LNS  described  in  Chapter  Three  is  shown  to  be 
memory  table  look-up  intensive,  when  it  comes  to 
performing  additions  or  subtractions.  There,  the  memory 
table  address  is  defined  as  the  absolute  difference  of  the 
LNS  exponents  of  the  two  operands.  For  x,  y being  the 
exponents  to  be  logarithmically  added  and  v - |x  - y|,  the 
tables  must  yield  the  values  of 

*(v)  = 1 r ( 1 + r“v)  and  Y(v)  = lr(l  - r-v) 

By  demanding  that  v is  always  positive,  the  values  of  $ 
and  Y are  always  less  than  one  and  therefore  they  require 
an  F-bit  representation  instead  of  a N-bit  one 
(N  = I + f).  In  Figure  5.1  the  curve  $(v)  versus  v is 
shown,  whereas  Figure  5.2  shows  the  curve  Y(v)  versus  v 
for  various  values  of  the  base  r.  It  can  be  observed  that 
a smaller  value  of  r results  in  a larger  table. 

The  (absolutely)  monotonically  decreasing  nature  of 
the  two  functions  indicates  that,  if  v exceeds  a certain 
value,  the  value  of  the  two  functions  will  be  less  than 
the  smallest  discrete  value  acceptable  by  the  system, 
determined  by  the  number  of  fractional  bits  available. 
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Q-  X 
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FIGURE  5.1 

Logarithmic  Addition  Mapping:  $(v)  versus  v 


CL  IS) 
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-7.3 1 1 1 i i i i i i 

0 u -->  10.0 

FIGURE  5.2 

Logarithmic  Subtraction  Mapping:  Y(v)  versus  v 
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This  observation  was  first  made  by  Kingsbury  and  Rayner 
[Kin71],  This  value  will  be  referred  to  as  "essential 
zero",  and  denoted  ZA  for  the  addition  and  Zg  for  the 
subtraction.  Again,  in  Figures  5.1  and  5.2  it  can  be 
observed  that  as  r increases,  the  values  of  Z and  Zc  are 
reduced.  By  using  a proof  similar  to  the  one  used  for  the 
base  optimization  in  Chapter  Three,  for  a given  r,  these 
two  variables  can  be  proven  to  depend  only  on  the  dynamic 
range  one  wants  to  cover  and  the  LNS  fractional 
wordlength.  They  can  be  computed  as  follows: 


*(ZA)  = 1 r ( 1+r  A) 

< 2-(F+1) 

i ♦ rz*  < 

r ( 2-F_1 ) 

'ZA  , 

r'rF_1»  - 1 

r < 

( 

N 

N 

> 

IV 

( 2-F-l ) 

- lr  r[Z  ’ 

- 1 

/ 

and 


Y(ZS)  = 1 r ( 1-r  S)  > -2-F+1 


1 - r Zs  < r 


(-2  F 1) 


-z 

r 


S 


> 1 


Zs  < - !r 


1-r 


( -2-F_1 ) 


(5.1) 


(5.2) 


By  using  the  equations  5.1  and  5.2,  some  indicative  values 
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of  ZA  and  Zg,  with  respect  to  certain  values  of  the  base  r 
and  the  fractional  wordlength  F,  are  tabulated  in  Table 
5.1.  The  code  for  generation  of  the  entries  of  Table  5.1 
is  listed  in  Appendix  G.  It  can  be  observed  that  for 
values  of  F > 6 and  r = 2,  the  values  of  the  essential 
zeros  converge  at  about  F + 1.52.  So,  even  though  the 
dynamic  range  V extends,  say,  to  2128,  requiring  7 integer 
bits,  the  number  of  fractional  bits  being  limited  to,  say 
12,  would  reduce  the  real  need  down  to  4 bits,  since  the 
essential  zeros  would  only  be  13.528.  This  would 
automatically  reduce  the  storage  requirements  by  a factor 
of  8 . 


Another  kind  of  optimization  can  be  achieved,  if  the 
wordlength  and  the  dynamic  range  to  be  covered  are  given. 
The  base  can  be  determined  in  such  a way  that  the 
essential  zero  will  be  a power  of  two.  This  will  allow 
for  no  unused  space  in  the  ROM,  since  for  any  number  F of 
inputs,  there  will  be  no  multiple  values  of  zeros  stored 
in  the  table.  For  an  essential  zero  of  V = 2V  and  a 
fractional  wordlength  of  F bits,  this  value  of  optimal 
base  can  be  computed  as  follows: 


5 . 2 Optimization  of  Base 


/ 


\ 


-2V 

lr  1 + r z 


< 2 


— F— 1 


or 


or 


r 


1 


or 
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TABLE  5.1 

Essential  Zeros  for  Various  Values  of  r and  F 


r 

F 

Z. 

z 

A 

s 

2 

5 

6.520947 

6.536572 

2 

6 

7.524858 

7.532671 

2 

7 

8 . 526813 

8.530719 

2 

8 

9.527790 

9.529743 

2 

9 

10.528278 

10.529255 

2 

10 

11 . 528522 

11.529011 

2 

11 

12.528644 

12 . 528888 

2 

12 

13.528705 

13.528827 

2 

13 

14.528736 

13.528797 

e 

12 

9.010852 

9.010974 

e 

8 

6.237347 

6.239301 

4 

12 

6.264322 

6.264444 

1.293 

12 

40.354559 

40.354681 

1.088 

12 

136.158753 

136.158875 
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By  solving  inequality  (5.3)  numerically,  the  values  of 
base  as  a function  of  v and  F were  computed  and  are  shown 
in  Table  5.2.  it  can  be  seen  that  all  of  them  are  between 
one  and  two,  thus  offering  one  more  justification  for 
choosing  the  base  to  be  greater  than  unity.  A less  than 
unity  choice  has  been  made  by  various  researchers  [Lan85] 
for  data  compression  purposes. 

5 • 3 Other  Table  Reduction  Techniques 

Frey  and  Taylor  [Fre85]  proposed  an  algorithmic 
method  for  memory  table  reduction  is  presented.  It  is 
based  on  the  "staircase"  shape  of  the  functions  $ and  Y 
and  results  in  at  least  two  bits  of  savings  when  applied 
in  addition  to  the  essential  zero  technique.  All  of  the 
above  methods,  plus  the  precision  enhancing  methods  to  be 
presented  in  the  next  chapters,  show  a feasible  way  to  the 
realization  of  a logarithmic  processor  having  a precision 
of  more  than  20  bits. 
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TABLE  5.2 

Optimal  Bases  for  Given  F and  V (=2V) 


F 

V 

r 

3 

3 

1.460 

5 

3 

1.674 

4 

3 

1.562 

4 

4 

1.293 

8 

7 

1.067 

12 

7 

1.088 

CHAPTER  SIX 

ADAPTIVE  RADIX  PROCESSOR 


Many  researchers 

have  devoted 

their  efforts 

to 

find 

precision  enhancing 

methods  for 

LNS , 

because 

of 

the 

restricted  wordlength 

of  the  fast 

memory 

tables 

and 

in 

addition  to  the  aforementioned  table  reduction  techniques. 
Fruit  of  this  effort,  for  example,  is  the  practical 
hardware  interpolation  method  offered  by  Taylor  [Tay83], 
In  the  following  section,  two  more  methods  will  be 
presented.  One  of  them,  called  the  Adaptive  Radix 
Processor  (ARP),  is  a "random  memory  access"  method,  while 
the  other,  hereafter  called  as  Associative  Memory 
Processor  (AMP),  is  a "sequential  access"  one. 


6 . 1 Adaptive  Radix  Processor  Principle 

The  Adaptive  Radix  Processor  partitions  the  table 
look-up  address  space,  in  order  to  minimize  the  finite 
wordlength  effects.  It  is  premised  on  the  fact  that  the 
address  v |x  - y|,  generated  for  the  addition  or 
subtraction  memory  look-ups,  is  likely  to  have  some,  (i), 
leading  zeros.  Such  an  address  can  be  shifted  by  a shift 
register  to  the  left  by  i bits.  Next  it  can  be  presented 
to  the  table(s)  performing  the  mappings  required  for 
logarithmic  addition  or  subtraction  respectively 
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f -v.  1 

1 + r.  1 
i 

r -v. ) 

■=  lr 

and  T.(vi)  - lr 

1 - r.  1 

i 

^ / 

. i o i 

with  = 2 v and  r^  = r 


The  final  result  of  the  operation  will  then  be  the  same  as 
if  the  normal  logarithmic  addition  or  subtraction  was 
looked-up,  and  ready  to  be  used  directly  in  any  further 
computations.  The  only  difference  is  that  now  i more  bits 
of  information  have  been  "compressed"  into  the  table 
space,  reducing  thus  the  total  entropy  of  the  system.  The 
way  that  the  ARP  processor  is  architected  for 
implementation  of  the  logarithmic  add/subtract  cycle  is 
presented  in  Figure  6.1.  It  is  the  processor's  ability  to 
adaptively  select  the  radix  during  run  time,  that 
contributed  to  its  naming.  Next,  an  error  analysis  that 
aspires  to  theoretically  justify  the  new  processor  will  be 
presented . 

6.2  Error  Analysis  for  the  ARP 

Multiplication  is  error  free  in  LNS . Therefore,  an 
error  analysis  for  the  ARP  needs  only  to  examine  the 
addition  part  of  the  processor. 

The  general  error  model  used  for  the  analysis  is 
shown  in  Figure  6.2.  There,  the  real  LNS  exponents  of  the 
two  numbers  to  be  added  are  denoted  as  x and  y.  Their 
machine  versions  are  x*  and  y*  respectively.  They  are 
associated  through  the  errors  and  a2»  which  were 
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FIGURE  6.1 

Add  Cycle  Implementing  the  ARP  Policy 
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FIGURE  6.2 

Error  Model  for  Logarithmic  Addition 
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examined  in  Chapter  Four  and  found  to  have  trapezoidal 
probability  density  functions.  For  the  analysis  presented 
here,  they  are  assumed  to  be  independent  random  variables. 
In  an  ideal  environment  (not  in  the  processor),  v would  be 
presented  to  a table  offering  as  outputs  the  homomorphism 
4>(v)  = lr(l  + r v)  and  the  LNS  exponent  of  the  result  of 
the  addition  would  be  x + $ ( v ) . The  machine  version  of  v 
(v*)  is  presented  to  a memory  table  as  a number  having  a 
precision  of  2 , where  F is  the  number  of  fractional 

bits  used  for  the  output  of  the  FLP  to  LNS  conversion 
table  0(mx)  = 1 r ( mx ) and  I is  the  number  of  (integer  only) 
bits  used  for  the  representation  of  x'  (as  in  the  FLP 
representation  X = mxrX  )•  Here  v*  is  a random  variable 
having  a triangular  p.d.f.  as  shown  in  Figure  6.3,  where  Z 
is  the  already  familiar  "essential  zero,"  discussed  in 
Chapter  Five.  The  triangular  shape  of  /v*(v*)  is  explained 
by  the  fact  that  v is  the  absolute  difference  of  two 
uniformly  distributed  numbers  [Pap65].  It  suggests  that  it 
is  likely  for  v*  to  have  some,  say  L,  leading  zeros, 
supplying  thus  the  basis  for  the  ARP  architecture. 

To  be  able  to  check  the  effects  of  the  shiftings 
suggested  by  the  ARP,  the  quantizer  QL  is  provided  in  the 
error  model,  introducing  thus  an  error 


U 


T~ 


wi  th 


<3l  = 2 


-F+I-L 


(6.2) 


Of  course,  the  error  model  should  provide  and  for  a 


an  n H-ftg  ji  co 
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$ ( v ) - lr(l  + r v) 


FIGURE  6.3 

P.d.f.  for  v = | x - y | 
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quantizer  QQ  at  the  output  of  the  "ideal"  table  *(v),  to 
take  care  of  the  finite  precision  effects  introduced  by 
the  table  wordlength.  This  introduces  the  error 


Eo  e u 

i 1 

1 

1c? 

V 

T~ 

with  qQ  = 2 F 

Finally  the  error 

E at  the  output  of  the 

given  as 

E = lr ( rx  + 

ry)  - 

lr 

t 

X* 

r + 

y*-E 

r J - E0 

E = lr ( rx  + 

ry)  - 

lr 

f 

x-a^ 

< 

y-a.-E 

+ < J - Eo 

E - lr[rx(  l + rY  x) 

]- 

lr  rx 

c 

-a.  y-x-a.  -E 

r i+r  1 L 

^ / 

LNS 


(6.3) 
adder  is 


or 


or 


(6.4) 


Without  loss  of  generality,  it  can  be  assumed  that  x > y. 
Then  for  v - |x  - y|  one  finds 


E = x + lr ( 1 + r v) 


x 


lr 


+ 


c_V~a2-EL 


or 


1 r ( 1 + 


r v)  - 


lr 


_a  -v-a  -E 

+ r 


- E, 


(6.5) 


Examining  equation  6.5  one  can  observe  that  the  p.d.f.  for 
E cannot  be  directly  computed,  because  of  the  presence  of 
the  random  variable  v,  whose  joint  density  function  with 
any  of  the  other  random  variables  (a.,  a9,  E_,  Er ) is 
unknown  or  difficult  to  obtain.  To  bypass  this  problem, 
one  can  assume  that  v is  fixed  (in  other  words  consider  it 
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to  be  a deterministic  variable)  and  calculate  analytical 
expressions  for  the  mean  and  variance  of  E.  Next/  one  can 
integrate  over  an  admissible  range  for  v,  divide  the 
result  by  the  range  metric  and  finally  multiply  it  with 
the  probability  that  v belongs  to  that  specific  range.  If 
vL  and  vH  are  the  low  and  high  limits  of  a range  for  v, 
then  the  probability  that  v belongs  to  that  domain  is 
calculated  to  be 


P[VL'  VH] 


(6.6) 


Since  is  always  bounded  by  qL/2  and  a1,  a2  are 
bounded  by  lr(l  + q)  + qx/2,  with  q = 2~F  and  q±  = q/2  , 
as  proved  in  Chapter  Four,  one  can  find  a condition  under 
which  El  will  be  in  general  greater  than  a1  and  a2  . This 
condition  is 


2-f+i-l-i  > lr(1  + 2-F)  + 2-f-2 


or 


L < I - F - 1 


- lg[lr(l  + 2 F)  + 2 F 2J 


(6.7) 


Operating  at  the  limits  of  the  available  technology,  that 
is  F = 12,  equation  (6.7)  is  reduced  to  the  condition  L < 
1-2.  Based  on  this  assumption,  equation  (6.5)  can  be 
rewritten  as 


E = 1 r ( 1 + r v) 


lr  ( 


(6.8) 


The  necessity  of  taking  into  account  the  errors, 
resulting  from  the  encoding  from  FLP  to  LNS , can  be  made 
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apparent  from  the  limit  values  for  E in  Table  6.1. 

Several  values  for  E are  shown  there  for  different  values 

of  v,  I,  and  L.  In  all  cases,  the  number  of  fractional 

bits  F = 12  and  the  base  is  always  taken  as  r - 2.  The 

variable  E_com  stands  for  the  comprehensive  error  after 

the  encoding  effects  are  taken  into  account,  whereas  E 

stands  for  the  error,  without  any  encoding  effects  being 

considered.  Several  claims  can  be  validated  by  observing 

this  table.  Firstly,  the  larger  the  shift  the  less  is  the 

error.  Secondly,  for  a fixed  value  of  v and  constant 

difference  of  I and  L,  the  error  remains  the  same. 

Thirdly,  a small  value  of  I means  a small  error  too. 

Finally,  in  all  cases,  E is  less  than  E_com. 

The  analysis  can  now  proceed,  since  the  random 

-v-E 

variable  P,  where  P - lr(l  + r ),  can  be  analyzed  as  a 

function  of  another  random  variable  (E  ).  Solving  for  E 

L L 

El  - - V - 1 r ( r P - 1)  (6.9) 


and 


3P 


3El  logr  -v-E 

1 + r 


-v-E 

3(1  + r L) 

3E. 


1 1 “v_EL 

Togr  =7=17  (-1)  logr  r 


1 + r 
-v-E. 


-v-E. 


(6.10) 


1 + r 


and 


3P 


rP  - 1 rP  - 1 


9EL  El  1 + rP  - 1 rP 


(6.11) 
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Limits  of  Error 


at 


TABLE  6.1 
the  Output  of 


the  Logarithmic  Adder. 


.5 
.5 
. 5 
.5 
.5 
. 5 
. 5 
.5 
.5 
,5 
, 1 
,1 
,1 
01 
01 
01 


5 

5 

5 

5 

5 

8 

8 

8 

10 

10 

5 

10 

10 

4 

3 

0 


4 
3 
2 
1 
0 
7 

5 
3 
3 
2 
0 
7 
3 

16 

16 

16 


E x 

lO"6 

Ecom  x 10  6 

-low 

high 

-low 

high 

223 

223 

637 

636 

324 

324 

738 

738 

527 

526 

940 

940 

931 

931 

1345 

1345 

1741 

1739 

2155 

2152 

223 

223 

637 

636 

527 

526 

940 

940 

1741 

1739 

2155 

2152 

6615 

6574 

7028 

6987 

13148 

12984 

13562 

13977 

2009 

2006 

2422 

2419 

594 

593 

1007 

1007 

7685 

7643 

8098 

8056 

244 

244 

657 

657 

365 

365 

779 

779 

2070 

2067 

2483 

2480 

E_com  is  with  and  E is  without  any  errors,  resulting  from 
encoding  from  other  arithmetic  systems,  being  considered. 
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so  that 


f p(p) 


J'el<el) 

i 3P  I 


3E 


L E, 


qL(rr  - 1) 


for  L1  < P < L2 


(6.12) 


wi  th 


L ^ = 1 r 


-v  - 


1 + r 


T~ 


(6.13) 


and 


L2  = lr 


-v  + 


1 + r 


T~ 


(6.14) 


If  M = 1 r ( 1 + r 


-v-E. 


’ ) + E 


O 


or  M = P + eq,  then 


VM> 


f 

*j  — c 


fn( P)  (M-P)  dP 


M+-y— 

rl  > 

J0  J 


/_(P)  dP  (6.15) 


M— 


T~ 


To  compute  the  value  of  the  above  integral,  precise 
knowledge  of  the  relative  positions  of  M and  the  limits  of 
the  domain  of  the  variable  P must  be  at  hand.  To 
facilitate  the  understanding  of  the  algebra  involved,  the 
unnecessary  complexity  can  be  avoided  by  defining 


A = 


^0  ' 


M-  M 

M = M + ~2 — 


and 


K±  = lr(rM+  - 1), 


K = 1 r ( 1 + r v)  and 


N± 


k-e+!° 

lr  r z - 1 


Lt 

i 


± • i = 1»2  (6.16) 
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2 ' U1  or  <30  < l2  - Li 


Case  i)  L2  > l| 

Then  M is  moving  along  the  axis 


/m(m) 


for  L1  < M < l‘ 


M 

L. 


r 2 p 

A I dP 

r - 1 


for  L2  < M < l2 


or 


/m(M) 


1 r ( r 


or 


fn(M)  = 


1 r ( r 1 

- 1) 

for  L~  < M < L 

K~] 

for  L*  < M < L 

J2 

- 1) 

- K~ 

for  L~  < M < L 

qL 

v + T~ 
K'] 

for  L~  < M < l| 
for  l|  < M < L~ 

(6.17) 


and  /M(M)  is  found  to  be  given  by 
rM+  P 

A | -/■ dP 

JL1rr  - 1 

A f_  71^77  dp  £or  L1  < « < 7 


5 


(6.18) 


" v + 1 K 


Case  ii) 


L2  < L1 


or 


for  L2  < m < L2 


q0  > L2  - 4 


(6.21) 
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Then  M is  moving  along  the  axis 


and  /^(M)  is  found  to  be  given  by 

pM+  P 

A f V dP 

JL,  r - 1 


fn( M)  = 


r 2 P 

A J -pi dP 

r - 1 

rL2  p 

A -p^—  dP 

M"  r - 1 


for  L1  < M < L2 


for  L2  < m < L*  (6.22) 


for  l|  < M < L+ 


or 


/m(M) 


K + v + 


T~ 


-V+J-+V+J- 

% 

~ v + y~  ~ K 


for  < M < L2 


- — for  L2  < M < (6.23) 


for  l|  < M < L2 


Finally,  the  error  at  the  output  of  the  logarithmic  adder 
is  given  by 


E = lr(l  +rv)-M=K-M 


(6.24) 


Solving  for  B,  f M = K - E and  ||||  = | -1 | = 1.  Then 


M 


/e(E) 


/e(E) 


3M 


/m(k-e) 


(6.25) 


M 
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Again,  for  the  same  two  cases  as  for  M,  expressions  for 
/ £ ( E ) are  computed  as 


Case  i 


If  qQ  < , then 


N + v + -j— 


/e(E) 


[n+  - N J 


- v + 


2“ 


- N 


for 

K - 

LI 

< 

E 

< 

K - 

for 

K - 

L2 

< 

E 

< 

K - 

for 

K - 

L2 

< 

E 

< 

K - 

- L. 


- L1  (6.26) 


- L. 


or 


/E(  E ) 


*—  v + ^ — — N 


[n+  - N ] 


N + v + 


T~ 


for  K-L2<e<K-L2 


for  K-L2<E<K-L1  (6.27) 


for  K — < E < K — 


It  can  be  observed  that 


a = K 
b = K 
c = K 
d = K 


7E(a) 

/E(  b ) 

fE(c) 

V d) 


^2 

-v-lr ( r z °-l) 


A 

0 


qL  Ll+qO 

+v-l r ( r 1 °-l) 


- (6.28) 


Then,  the  complicated  expression  (6.27)  can  be  very 
accurately  approximated  by  the  trapezoidal  shaped 


76 


function,  shown  in  Figure  6.4  and  expressed  as 


/e(E) 


(e  - a) 


TOP 
b - i 


(E  - a) 


TOP  for  b < E < c 
TOP 


(E  - d) 


c - d 


(d  - E) 


TOP 

^0 


TOP 


for  a < E < b 


(6.29) 


for  c < E < d 


where  TOP  can  be  computed  by  requiring  that 
/E(E)dE  = 1,  as 


TOP  f 

S0  Jc 


E-a  dE  + TOP  dE  - 


rc  dE  - f E- 
Jb  Sq  Jc 


d dE  - 1 


or 


TOP 


L2  " L1 


(6.30) 


As  explained  earlier,  the  mean  and  variance  of  the  error  E 
are  necessary  for  the  final  justifications  and  they  are 
computed  as 


Mean  and  Variance  of  E for  case  i) 
The  mean  of  E is  given  by 

M1[E]  = f VE)E  dE  = 

J — GO 


TOP 

s0 

TOP 

! s0 

TOP 


f 

J P 


(E-a)E  dE  + TOP 


J 

Jl 


E dE  - 

b s 


TOP 

0 


J! 


bj 

T~ 

2 

c 

T~ 


a3 

T~ 

b2l 


ab' 


T~ 


(E-d)E 


dc" 


T~  T~ 


(6 


dE 


+ 

31) 


and  the  variance  of  E is 
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i ■ t i 


i 


T 


( 


FIGURE  6.4 

Shape  of  /(E)  for  Case  i)  of  LNS  Addition 


78 


D1 [ E ] = [ /e(E)E2  dE  - M2[E] 

j — 00 

op  r 

0 J 


.TOP 

=q 


(E-a)E2  dE  + TOP  f E2  dE  - 

* b q0  “ c 


(E-d)E  dE 


TOP 

V 

TOP 


,4  a4  ah3  a4  j 4 4 ,4  , 3 

b a ab  , a d , c d dc 

-T-  + 1--T-  + TT  + 1 T~ 


T ~ T~ 


3 ,3 

c b 

y~  ' T" 


- Mj[E] 


Case  ii) 


If  qQ  > L2  - , then 


VE>  - •! 


- V + 


T~ 


- N 


_1 

^0 


N + v + 


T~ 


for  K-L2<E<K-L 


for  K-L1<E<K-L 


for  K-L2<E<K-Li 


It  can  be  observed  that 


/ ^ 

i-K-Lj 

' /E(i)  - o 

j = K - L” 

'■'i>  - ^ 

► =>  < 

1 

k - K - L+ 

^<k|  * 55 

1 - K - L| 

f E*1'  ' 0 

< > 

^ / 

Then,  the  complicated  expression  (6.33)  can  be 
accurately 


- M2[E] 


(6.32) 


(6.33) 


(6.34) 


very 


approximated  by  the  trapezoidal-like  function 
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FIGURE  6.5 

Shape  of  fg ( E ) for  Case  ii)  of  LNS  Addition 


80 


shown  in  Figure  6.5  and  expressed  as 


fE(E) 


for  i < E < j 
for  j < E < k 
for  k < E < 1 


or 

for  i < E < j 
for  j < E < k 
for  k < E < 1 

where  TOP  has  the  same  value  as  in  case  i. 

Mean  and  Variance  of  E for  case  ii) 

The  mean  of  E is  given  by 


/e(e)  = 


(E  - 


i)  top 


(1  - E) 


TOP 

^0 


M2[E]  = [ / (E)E  dE 

J — 00 


22  [Vi 

0 Ji 


.TOP 

'q 


i ) E dE  + TOP  E dE  - 


J 


TOP 


f (E- 


TOP 

V 

TOP 


j3  " 
2 .2 


, 3 


j i ij_  . i 1 k 1 

3 1 T~  + T-~l-  + l~  + Y~ 


1)E  dE 


Ik" 


T~ 


and  the  variance  of  E is 


d2IE1 


| fj-IEIE2  dE  - M?[E]  - 
j — 00 


(6.35) 


(6.36) 


(6.37) 
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E2  dE  - M2[E] 


(6.38) 


In  Figures  6.4  and  6.5  the  solid  lines  represent  both 
the  complete  analytical  forms  of  ( E ) , given  by  the 
expressions  (6.27)  and  (6.33)  for  the  two  cases  examined, 
as  well  as  approximate  analytical  forms,  described  by  the 
equations  (6.29)  and  (6.33)  respectively.  The  two  curves 
are  shown  to  be  absolutely  identical,  in  both  cases.  The 
interrupted  lines  represent  experimental  curves  generated 
for  a variety  of  values  for  the  variables  v,  L,  I,  F. 
Because  of  the  similarity  of  these  curves,  only  one 
example  is  shown  for  each  of  the  two  distinct  cases.  The 
coincidence  with  the  theoretical  curves  is  shown  to  be 
almost  identical.  For  both  figures  the  sample  size  was 
taken  to  be  30,000.  F was  chosen  to  be  12,  v was  .5  and  I 
was  5.  For  Figure  6.4,  L was  3,  whereas  for  Figure  6.5,  L 
was  taken  to  be  5.  These  curves  were  produced  after 
dividing  the  values  of  the  histograms,  that  were  generated 
for  E,  into  the  histograms'  steps  and  the  sample  sizes. 
The  code  for  the  theoretical  form  of  /E(E)  is  listed  in 
Appendix  H,  while  the  plot  of  its  theoretical 
approximation  is  generated  by  the  code  listed  in  Appendix 
I.  Appendix  J contains  the  code  for  the  computer  simulated 
error  of  the  logarithmic  adder. 
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Using  the  equations  ( 6 . 31 ) - ( 6 . 32 ) and  ( 6 . 37  ) - ( 6 . 38  ) , 
according  to  the  case,  a computer  search  was  performed, 
based  on  the  procedure  described  after  equation  (6.5).  The 
results  permitted  the  sophisticated  partitioning  of  the 
range  of  addresses  into  T adjoining  regions  (i.e.;  compact 
cover)  which  are  the  address  subranges  for  T dedicated 
look-up  tables.  The  criterion  of  choice  was  the  one  of 
minimal  overall  error  variance,  described  by  the  above 
formulas.  The  results  for  optimal  three  and  two-level 
decompositions  of  the  address  space,  for  various  values  of 
the  base  r,  F,  I,  and  Z are  tabulated  in  Table  6.2.  The 
blank  spaces  in  this  table  indicate  that  condition  (6.7) 
was  taken  into  account  and  large  multibit  shifts  were 
prevented  from  taking  place,  in  order  to  have  control  over 
the  results  with  the  derived  analytical  formulas.  The 
listing  of  the  code  used  to  generate  the  entries  for  Table 
6.2  can  be  found  in  Appendix  K.  All  of  the  above  analysis 
holds  true,  not  only  when  L t 0.  In  other  words  not  only 
when  the  ARP  idea  has  been  implemented,  but  also  in  the 
special  case,  where  no  shifting  of  the  address  v takes 
place.  The  only  modification  that  has  to  be  made  to  the 
formulas  already  derived  is  to  set  L = 0 . On  the  other 
hand,  condition  (6.7)  does  not  allow  for  large  multibit 
shifts,  if  the  capability  for  quantitative  prediction  of 
the  error  behavior  of  the  LNS  processor  is  desirable.  But 
large  shifts  require  a significant  increase  in  the  number 
of  additional  tables  to  be  used  for  the  #^(v)  mappings. 
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TABLE  6.2 

Experimental  Values  For  the  Error  Variance 
for  Two  and  Three-Level  Partitioning 
and  Without  Using  ARP  at  all. 


r 

F 

z 

I 

Error 

Variance  x 

IQ"10 

3-level 

2-level 

No  Break 

2 

12 

8 

8 

4370 

11100 

197000 

2 

12 

8 

7 

1130 

2820 

49300 

2 

12 

8 

6 

3180 

744 

12300 

2 

12 

8 

5 

223 

3130 

2 

9 

8 

7 

72000 

180000 

3150000 

2 

9 

8 

6 

20300 

47500 

791000 

2 

9 

8 

5 

14200 

2000 

2 

9 

8 

4 

17300 

52400 

2 

9 

16 

3 

14500 

15400 
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This  can  be  costly  and  contributes  to  the  complexity  of 
the  system.  Also,  the  analysis  showed  that,  by  imitating  a 
signal  amplifier,  it  is  sufficient  to  shift  the  addresses 
(which  are  very  close  to  zero)  even  only  by  1 bit  in  order 
to  obtain  significant  reduction  in  the  error  variance. 
This  suggests  that  at  most  2 additional  tables  (a  total  of 
three)  will  usually  bring  the  error  variance  down  to 
acceptable  levels. 


6 . 3 Presorting  of  Operands 


The  precedi 
accuracy  gain 
the  two  LNS  oper 
small  so  that 
the  data  through 
ARP-LNS  case  the 
were  uniformly  d 
This  was  the 
shape  for  f (v) . 
investment  in 
suggests)  can  be 
then  fv(v)  -> 
implies  that,  fo 
bits,  the  error 
2~^L,  thus  resul 


ng  error  analysis  reveals  that  the  maximum 
exists  whenever  the  absolute  difference  of 
ands  to  be  added  or  subtracted  is  very 
additional  information  can  be  packed  into 
the  shifting  procedure.  In  the  general 


assumption 

was  made 

that  the 

two  operands 

istributed 

and  in  no  degree 

correlated. 

basis  for 

the  assumption  of 

a triangular 

Howeve  r , 

i f the 

luxury 

(for  modest 

sorting 

hardware 

and  the 

latency  it 

afforded 

by  the 

specific 

application, 

S^(0)  (Kroenecker  delta  function).  This 
r a maximum  allowable  address-shift  of  L 
variance,  would  be  reduced  by  a factor  of 
ting  in  major  dividends,  when  compared  to 


non-ARP  designs. 
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6 . 4 Signed-digit  ARP-LNS 

The  ARP  design  problem  acquires  a very  interesting 
dimension  when  alternative  number  systems  become 
prospective  employees.  Such  an  instance  is  when  the 
redundant  signed-digit  (SD)  system  and  especially  the 
canonical  one  is  used  for  representing  the  binary  LNS 
exponents.  A brief  review  of  the  canonical  system  follows. 

6.4.1  Signed-Digit  Number  Systems 

Following  Hwang  [Hwa79],  SD  numbers  are  formally 
defined  in  terms  of  a radix  r,  and  digits  which  can  assume 
the  following  2a+l  values 

Vr  = {—a,.  . .,—1,0,1,.  . . , a} 

where  the  maximum  digit  magnitude  a must  be  within  the 
region 

pHr^l  < « < f - i 

The  original  motivation  of  using  SD  number  system  was  to 
eliminate  the  need  for  carry  propagation  chains  in  an 
addition/subtraction  task.  To  break  the  carry  chain  it  is 
necessary  to  make  the  lower  bound  on  a tighter  as 


The  number  of  nonzero  digits  in  an  n-digit  SD  vector,  say, 

n-1 

Y = (yn-l‘  ' •yly0*y-ly-k)r  = with  value  Yu  = Z yir1' 

i— k 

is  called  the  weight  w(n,  Y^).  In  general  the  weight  of  a 
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binary  n-digit  SD  vector  is  defined  below  with 

|y± I - 1 if  Yi'O 

n-1 

w(n'  V = E lyi  I 

i = 0 


The  SD  vector  with  the  minimal  weight  is  called  a minimal 

SD  representation  with  respect  to  given  values  of  n and 

Y . A minimal  SD  vector  D = D ..  . .D^.  that  contains 
v n-1  1 0 

no  adjacent  nonzero  digits  is  called  a canonical  signed- 
digit vector.  Reitwiesner  [Rei60]  showed  that  there 
exists  a "unique"  canonical  SD  form  D for  any  digital 
number  with  a fixed  value  a and  a fixed  vector  length  n, 
provided  the  product  of  the  two  leftmost  digits  in  D does 
not  equal  one,  that  is 


D . x D _ t 1 
n-1  n-2 


This  property  can  be  always  satisfied  by  imposing  an 
additional  digit  Dn  - 0 to  the  left  end  of  the  vector  D. 

A procedure  to  transform  a (n+l)-digit  binary  vector 


B = B B . . 

n n-1 


B.Bn  with  B =0  and  B.  e {0,1}  for 
1 U n i 1 ' 


0 < i < n-1,  to  a canonical  SD  vector  D = D D . .D.DA 

with  Dn  = 0 and  Di  = {I,  0,  1}  such  that  both  vectors 

represent  the  same  value  is  given  below,  where  the  carry 
vector  is  symbolized  as  C 

Step  1.  Start  with  LSB  of  B.  Set  i=0,  Cq  = 0. 

Step  2.  Examine  B^  + ^,  anc*  to  generate 

C. = B.  ,B.  + B.C.  + B.  .C. 
l+l  l+l  l li  l+l  l 

Step  3.  Generate  D.  = B.  + C.  - 2C.  . 

^ ill  l+l 
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Step  4.  Increment  index  i by  one.  Go  to  step  2 if 
i < n.  Stop  otherwise. 

The  conversion  from  canonical  to  binary  can  be  done  by 
subtracting  the  digit  vector  that  results  if  only  the 
negative  digits  of  D are  considered  and  setting  the  rest 
of  them  to  zero,  from  the  one  that  results  if  only  the 
positive  digits  are  considered. 

6.4.2  Signed-digit  Addition 


The  addition  or  subtraction  of  two  SD  numbers  X and  Y 
is  considered  for  the  following  discussion.  The  numbers 
involved  are  assumed  to  be  (n+k)-digit  SD  numbers 

X = (xn_1.  . .x1x0.x_1x_k)r  and 
Y 


"n-1  ’ ' ' 

yn-l-  • 

The  ith  sum  digit  of  the  resulting  sum 


( s 


n-1  ’ 


• sls0  * s-ls-k ) r 


X + Y 


is  s^  and  t^  is  the  transfer  digit  (carry)  from  the  ( i- 
1 ) th  digital  position.  The  difference  of  the  transit  digit 
from  the  regular  carry  or  borrow  digits  is  that,  unlike 
these,  it  may  assume  negative  values.  An  "interim  sum" 
digit  wi  is  generated  in  intermediate  stages. 

According  to  Avizienis  [Avi61],  a totally  parallel 
addition  or  subtraction,  restricts  signed-digit 
representations  to  radices  r>2.  Furthermore,  at  least  r+2 
values  are  required  for  the  sum  digits  s^  and  requirements 
for  parallel  subtraction  would  increase  the  required 
number  of  subtrahend  digit  values  to  r+3  for  even 


88 


radices.  However,  the  digit  addition  rules  may  be  modified 
to  allow  for  propagation  of  the  transfer  digit  over  two 
digital  positions  to  the  left.  If  this  two-transfer 
addition  is  allowed,  the  radix  r-2  may  be  used  and  only 
r+1  values  are  required  for  the  sum  digit.  Two-transfer 
addition  is  executed  in  the  three  successive  steps  shown 
below  (the  adder  diagram  from  Avizienis  [Avi61]  is  shown 
in  Figure  6.6  and  an  addition  example  is  given  in  Figure 
6.7) 


where 


1) 
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+ 
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0 
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i 

Yi  < -1 

-1 

s. 

if  ♦ t: 

= -2 
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digits  take 

values : 

x . , y . , s.  , 

l ' 1 1 l 

/ II  1 It 

t • , t.  , W.,  W.  G 

1 ' 1 ' 1 ' 1 

{I, 

0 ,1). 

From 

the  Figure  6.6 

it  can  be  seen  that  there  is  a 

carry 

propagation  over  two  successive  digit  positions. 
Avizienis  [Avi61]  concluded  that,  in  general,  the  lower 
limit  on  the  required  redundancy  of  one  digit  is  a 
function  of  the  number  of  digital  positions  over  which  a 
signal  is  allowed  to  propagate.  If  no  redundancy  exists 
and  each  sum  digit  assumes  only  r values,  a sum  digit  s^ 
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FIGURE  6.6 

Basic  Computation  Kernel  for  a Two-Transfer  Addition 
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X = ( T . 01001 ) 2 = ( - .71875 ) 10 
Y = ( 0 . 10010 ) 2 = ( .4375)1q 
S = ( 0 . OlOOl ) 2 = ( - . 28125 ) 1Q 


X. 

l 


0 


Y. 

l 


0 


X.+Y.  (> 

l l v 

< 0 

"i  o 

ui+ti  c> 

si  o 


i 


FIGURE  6.7 

Two-Transfer  Addition  Example 
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is  a function  of  all  the  addend  digits  and  the  augend 

digits  yi  to  its  right,  i.e.,  s.  = /(x.,  y.  , xi+1#. 

. ,x  , ym).  If  the  sum  digit  assumes  r+1  values,  then  s.  = 

f(xi(  yif  xi  + 1,  ^i+i'  xi+2'  yi+2^  and  the  °Peration  is  the 
two-transfer  addition.  If  each  sum  digit  assumes  r+2 
values  or  more,  then  si  = /(x.,  y . , xi  + 1,  yi+1)  and  with 
the  restriction  r>3,  the  addition  is  totally  parallel.  It 
is  clear  that  the  price  to  be  paid  for  totally  parallel 
addition  is  the  required  (by  the  redundancy)  additional 
storage . 

6.4.3  Canonical  ARP-LNS 

The  canonical  system  can  be  employed  for  representing 

the  LNS  exponents.  Since  all  of  the  operations,  including 

multiplications  and  divisions,  are  based  on  additions  or 

subtractions  and  table  look-ups,  it  is  important  to  be 

% 

able  to  perform  these  operations  as  fast  as  possible.  The 
time  requirements  for  any  operation  are  greatly  reduced 
into  the  time,  which  is  required  for  the  execution  of  the 
three  steps  that  the  two-transfer  addition  calls  for.  If 
the  basic  computational  kernel,  shown  in  Figure  6.6,  is 
used  in  a pipeline  mode,  then  the  memory  table  input 
address  calculation  may  start  with  the  most  significant 
digit  first.  According  to  the  ARP  LNS  design,  for  a L- 
partition  problem,  L memory  tables  will  be  candidates  for 
access.  The  generation  of  the  first  L-l  digits  of  the 
address  v will  be  enough  for  the  memory  chip  selection, 
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which  can  be  done  even  before  the  calculation  of  the 
address  is  completed.  This  introduces  additional  savings, 
that  write  off  the  kernel  computation  time.  In  a non- 
pipelined  mode  the  computation  of  the  address  v will  be 
carried  out  in  an  almost  parallel  fashion  and  the  whole 
operation  will  only  require  the  kernel  computation  time. 
Since  LNS  multiplication  and  division  are  reduced  to 
normal  addition,  this  presents  the  side  advantage  that  the 
execution  time  for  these  operations  will  remain  basically 
independent  of  the  LNS  wordlength  and  the  timing 
considerations  for  a VLSI  design  are  thus  greatly 
minimized.  Expansions  of  the  wordlength  based  on 
technological  advances  in  memory  design  and  construction, 
effecting  the  design  of  addition  and  subtraction,  will  not 
effect  the  timing  considerations  for  multiplication  or 
division.  At  the  expense  of  a little  more  storage 
hardware,  accommodating  the  prepended  MSB  digit  (always 
zero)  and  the  redundancy  introduced  by  the  canonical  digit 
representation,  a very  fast  and  modular  engine  has 


evolved . 


CHAPTER  SEVEN 

ASSOCIATIVE  MEMORY  PROCESSOR 


7 . 1 Associative  Memory  Processor  Principle 


The  established  way  to  access  a memory  table  requires 
to  store  all  items  in  the  storage  medium  where  they  can  be 
addressed  in  sequence.  The  search  procedure  is  a strategy 
for  choosing  a sequence  of  addresses,  reading  the  content 
of  memory  at  each  address,  and  comparing  the  information 
read  with  the  item  being  searched  until  a match  occurs. 
The  number  of  accesses  to  memory  depends  on  the  location 
of  the  item  and  the  efficiency  of  the  search  algorithm. 
Several  search  algorithms  have  been  developed,  trying  to 
minimize  the  number  of  accesses,  while  searching  for  an 
item  in  memory.  The  time  required  to  find  the  item  can  be 
reduced  considerably,  if  the  stored  data  can  be  identified 
for  access  by  their  content,  generally  some  specified 
subfield,  rather  than  by  their  address.  Memory  units 
accessed  by  content  are  generically  classified  as 
associative  memories  or  content  addressable  memories  (CAM) 
[Koh78].  Because  of  its  organization,  the  associative 
memory  (AM)  is  uniquely  suited  for  parallel  searches  by 
data  association.  Moreover,  searches  can  be  done  on  a 
entire  word  or  on  certain  fields  within  a word. 
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The  idea  of  using  AM  to  support  the  LNS  look-up  task 
is  examined  in  this  dissertation.  It  can  be  implemented  as 
follows:  The  N-bit  address  field  is  partitioned  into  a set 
of  adjoint  subfields.  The  subfield  data  of  v are 
presented  to  a section  of  the  associative  memory,  which 
provides  a partial  answer  to  the  value  of  $(v)  or  Y(v).  As 
a means  for  illustration  of  the  method,  suppose  that  the 
N-bit  field  of  the  input  address  v is  split  into  two 
subfields.  The  most  significant  block  (MSB)  consists  of 
bits  of  data,  with  ( - I -)  fractional  bits  (I  are 
the  integer  bits  of  v),  while  the  least  significant  one 
( LSB ) consists  of  N2  (=  N - ) bits.  The  mapping  offered 

by  the  first  module  of  AM  as  a response  to  MSB  may  belong 
into  one  of  two  possible  classifications 


1. 

Unambiguous 

( denoted 

U) 

2. 

Ambiguous 

( denoted 

A) 

There  will  be  an  ambiguous  result  whenever  bits  of 
input  are  not  enough  to  determine  a unique  mapping. 
Otherwise  the  result  will  be  unambiguous . Consider  the 
data  offered  in  Table  7.1,  where  the  discrete  independent 
variables  v are  mapped  into  discrete  outputs  through  $(v). 
It  is  taken  that  the  final  output  consists  of  F»5  bits, 
the  input  consists  of  N=7  bits  and  the  dynamic  range 
extends  from  0.4  to  0.52.  It  can  be  observed  that  if 
N^=4,  there  are  two  A states:  0110  and  1000,  resulting 
into  more  than  one  outputs,  and  one  U state:  0110, 
resulting  only  in  output  11010.  If  N^-5,  then  again  there 
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TABLE  7.1 

Ambiguous  and  Unambiguous  States 
the  Associative  Memory  Modules 


Generated  by 
of  the  AMP. 


Decimal 

V 

Decimal 
*(  v) 

Binary 

V 

Binary 
*(  v) 

States 

0 . 398437 

0.812500 

0110011 

11010 

0.406250 

0.812500 

0110100 

11010 

0.414062 

0.812500 

0110101 

11010 

u 

0.421875 

0.812500 

0110110 

11010 

0.429687 

0.812500 

0110111 

11010 

0.437500 

0.812500 

0111000 

11010 

A11 

0.445312 

0.781250 

0111001 

11001 

0.453125 

0.781250 

0111010 

11001 

0.460937 

0.781250 

0111011 

11001 

0.468750 

0.781250 

0111100 

11001 

Al  - 

0.476562 

0.781250 

0111101 

11001 

Z 

0.484375 

0.781250 

0111110 

11001 

0.492187 

0.781250 

0111111 

11001 

0.500000 

0.781250 

1000000 

11001 

A2 . 

0.507812 

0.781250 

1000001 

11001 

1 

0.515625 

0.750000 

1000010 

11000 

A22 

N=7,  F=5 , N1  = 4,  Range:  0.4  -»  0.52 
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are  two  A states:  OHIO  and  10000  (with  fewer  members 

though),  but  more  U states,  and  so  on.  The  ambiguity  will 
have  to  be  resolved  by  using  the  LSB  of  input  information. 
This,  in  addition  to  some  information  regarding  the  A 
states  encountered  in  the  first  module,  will  be  input  to 
the  second  AM  module,  which  will  produce  the  final  answer. 

In  general,  there  could  be  up  to  K R-bit-wide 
associative  memory  modules,  which  will  accept  up  to 
bits  of  data  each.  This  data  field  would  consist  of 
bits  of  information  from  the  present  subfield  of  input  and 
bits  of  information  generated  by  the  previous  AM 
module  as  suggested  in  Figure  7.1.  The  table  will  export 
an  F-bit  output  U,  or  if  that  is  not  feasible  for  the 
specific  module,  a S-bit  wide  information  word  for  the 
next  module.  Each  table  will  be  able  to  perform  an 
ordinal  magnitude  comparison  over  a + Ni  = Ri  ^ R-bit 

field  and  map  the  result  into  a F-bit  output  field.  If  the 
decision  by  the  associative  memory  is  that  the  input 
represents  a U state,  then  the  F-bit  output  field  is 
filled  with  the  precomputed  rounded  value  of  4>(v).  A 
detected  A condition  will  cause  a (dis)abling  bit  to  be 
set  so  as  to  clear  the  outputs  of  any  of  the  AM  modules, 
which  are  laid  to  its  right.  Of  course,  the  critical  issue 
is  how  to  partition  N into  K N^-bit  fields,  so  that 


For  i > 1 the  individual  N.  will  have  to  share  the  table 

i 

address  space  with  the  S^_^-bit  word  arriving  from  the 


N 


1 


i = 1,2,.  . . , K 
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FIGURE  7.1 

General  LNS  Associative  Memory  Processor  Principle 
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table's  left  neighbor.  Of  course,  since  carries  the 
information  needed  for  distinguishing  between  the  A 
states,  the  number  of  bits  committed  to  is  determined 
by  the  total  number  of  A states  detected  at  that  level. 
It  will  be  equal  to  [lg(number  of  A states)]. 


7 . 2 Optimization 

The  discrete  and  rounded  versions  of  $(v)  and  Y(v), 
not  being  homomorphi sms , do  not  allow  for  a prediction  of 
the  number  of  U or  A classes,  which  may  evolve  from  a 
given  partition.  Factors  that  influence  the  cardinality 
of  these  sets  are  the  base  r,  the  number  of  associative 
modules  K and  the  address  space  of  these  modules, 
determined  by  R.  The  degree  of  their  influence  can  be 
estimated  and  relative  optimization  can  be  achieved  only 
through  a numerical  study  using  dynamic  programming 
principles.  The  two  design  attributes  that  are  examined 
from  an  optimization  standpoint  in  this  dissertation  are 
the  speed  and  the  complexity.  Considered  to  be  given,  were 

• R:  maximum  address  space  for  a semiconductor 

memory  with  cycle  time  tD 

• N:  unsigned  LNS  wordlength 

Although  not  a general  rule,  technology  suggests  that  if 
R^  < R2 » then  tR  < tR  . In  a pipelined  architecture,  the 

optimization  target  would  be  to  minimize  the  largest 
necessary  address  space  of  memory  table.  In  other  words. 


the  function  4>1(R)  = min|max{Ni  + for  i = 1 , . . ., 


K 


/ 
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should  be  minimized.  If  the  goal  is  to  minimize  latency 
(non  pipelined  throughput),  then  the  function 


*2(R) 


min 


should  be  optimized.  The  values  of 


tR  can  be  obtained  from  data  books,  or  modeled  as 
1 

tD  = c lg(R. ),  where  c is  a proportionality  constant  and 

K • 1 

1 

lg(R^)  reflects  the  addressing  overhead.  Finally,  if 
complexity  is  the  issue  then  the  number  K of  AM  modules 
should  be  minimized.  Since,  in  practice,  K would  be  a 
relatively  small  integer  with  R ranging  from  6 to  14  bits, 
an  iterative  search  procedure  could  be  used,  to  optimize 
the  design  with  respect  to  the  chosen  criteria  of 
optimality.  Summarizing,  the  side  constraints  imposed  to 
the  optimization  search  should  be  the  following: 


K 

1.  01(K,R)  = ^ Ni  > N 

i»l 


2.  0-,(K,R)  = N.  + S.  n = R.  < R (7.1) 

2 l i-l  l - 

3.  03(K,R)  = Si_1  < F ; i > 1,  SQ  - 0 

where  again,  is  the  blocksize  of  the  ith  subfield,  R is 
the  maximum  allowable  address  space  for  the  AM  modules,  F 
is  the  output  wordlength  (only  fractional  bits),  K is  the 
population  of  AM  modules  and  S^  is  the  number  of  bits 

exported  by  the  ith  module,  determined  by  the  number  of  A 
classes  (S^  = F,  if  the  module  offers  a U answer). 
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7.2.1  Latency  Estimation 

Based  on  the  above  optimization  goals,  several 
experiments  were  carried  out.  A numerical  search  was 
performed  to  locate  the  minimum  value  for  <t>^(R),  for 
various  values  of  N,  F,  and  r.  The  code  used  is  listed  in 
Appendix  L and  some  of  the  obtained  values  are  tabulated 
in  Table  7.2.  Analysis  of  these  results  showed  that  for 
specified  dynamic  range  (determined  by  I),  fractional 
wordlength,  and  number  of  AM  modules,  the  minimum  value  of 
was  always  N - K + 1.  For  example,  for  K-4  and  N=18, 
it  was  never  found  to  be  less  than  15.  This  minimum  value 
was  also  found  to  be  the  average  value,  which  means  that 
all  of  the  AM  modules  would  present  the  same  amount  of 
(rather  big)  slowness.  Investing  one  more  bit  for  the 
slowest  table,  some  of  the  tables  could  be  constructed 
with  a minimum  input  address  length  of  up  to  three  bits, 
but  this  tradeoff  does  not  result  in  any  substantial 
savings,  while  on  the  other  hand  it  could  complicate  the 
design  of  the  processor  by  eliminating  the  existing 
uni f ormi ty . 

7.2.2  Tree  Structure  Experiments 

A binary  tree  structure  spawning  both  A and  U states 
can  be  employed  to  implement  the  associative  mappings.  For 
example,  the  data  found  in  Table  7.1  would  require  a tree 
like  the  one  shown  in  Figure  7.2  (for  = 4).  Not  all  of 
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TABLE  7.2 

Optimal  Values  of  <f>.(R)  for  AM  Processor  Design 
for  Varying  Values  of  N,  K,  F and  r = 2. 


i 

N. 

l 

Si-1 

^(R) 

N=18 

1 

15 

0 

F=1 4 

2 

1 

14 

15 

K=  4 

3 

1 

14 

4 

0 

13 

N=1 2 

1 

10 

0 

F=9 

2 

1 

8 

10 

K=  3 

3 

1 

7 

N=8 

1 

1 

0 

F=  5 

2 

5 

1 

6 

K=  3 

3 

2 

4 

N=18 

1 

2 

0 

F-14 

2 

14 

2 

16 

K=3 

3 

2 

14 

N=18 

1 

15 

0 

F-15 

2 

2 

14 

16 

K=  3 

3 

1 

13 
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FIGURE  7.2 

Tree-Structured  AM  module  of  AMP  (see  Table  7.1) 
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the  tree  nodes  are  necessarily  occupied  by  switches. 
Depending  on  the  ambiguity  or  not  of  the  resulting  states, 
a tree  node  may  become  a leaf  node  even  though  strictly  it 
doesn't  belong  to  the  last  level  of  the  tree.  If  this  is 
the  case,  the  need  for  a switch  is  eliminated.  An  attempt 
was  also  made  to  employ  a bit-serial  architecture  by 
constructing  the  whole  memory  table  via  a tree  structure. 
Experimentation  was  performed  regarding  the  amount  of 
hardware  (in  this  case  the  switches  necessary  to  implement 
the  corresponding  trees).  The  results  were  obtained  via 
the  code  listed  in  Appendix  M and  are  depicted  in  Figure 
7.3.  It  can  be  seen  that  the  number  of  switches  increases 
exponentially  with  the  fractional  wordlength.  The  demand 
in  switches  becomes  really  huge  after  F becomes  greater 
than  13  bits. 

7.2.3  Effect  of  Base  on  Clustering 

Different  requirements  for  the  input  address  space  of 
the  memory  table  are  also  the  result  of  a different  choice 
for  the  base  of  the  LNS  processor.  Again,  experimentation 
was  performed  to  determine  the  effect  of  different  radices 
on  clustering  of  the  input  address  space.  The  code  used  is 
again  the  one  listed  in  Appendix  L and  some  of  the  results 
are  shown  in  Table  7.3.  It  can  be  observed  that  different 
radices  do  not  result  into  reduced  input  address  space, 
unless  there  is  a substancial  decrease  in  the  number  of 
bits  used  for  the  output  mapping  representation  and, 
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FIGURE  7.3 

Number  of  Switches  versus  F for  Tree-Structured 

AM  Modules 
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TABLE  7.3 

Effect  of  Radix  Variation  on  the  Input  Address  Space 

for  the  AM  Processor 


r 

i 

N. 

l 

Si-1 

^(R) 

1 

2 

0 

2 

2 

5 

1 

6 

3 

3 

3 

1 

2 

0 

3 

2 

6 

1 

7 

3 

2 

5 

1 

5 

0 

3 

2 

3 

1 

6 

3 

2 

4 

1 

3 

0 

4 

2 

4 

1 

5 

3 

3 

2 

1 

2 

0 

4 

2 

5 

1 

7 

3 

3 

4 

1 

1 

0 

0.85 

2 

1 

1 

10 

3 

8 

2 

N=10,  F=3 , 1 = 4,  K=3,  Range:  0 ->  16. 
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afterall,  there  is  never  more  than  one  bit  of  savings. 
This  is  not  worth  the  extra  hardware  burden  imposed  by  the 
deviation  from  the  choice  of  r = 2 and  the  reduction  that 
is  suffered  by  the  precision  of  the  LNS  processor  itself. 

7 . 3 Comparison  of  AMP  and  ARP 

The  above  analysis,  together  with  the  one  presented 
in  Chapter  Six,  proves  the  ARP  LNS  design  to  be  superior 
to  the  AMP  one,  not  only  in  terms  of  speed  (there  is  no 
latency  for  the  ARP),  but  also  in  terms  of  actual  memory 
cells,  required  to  achieve  a certain  amount  of  precision. 


CHAPTER  EIGHT 

COMPLEXITY  OF  OPERATIONS  FOR  LNS  PROCESSORS 

8 . 1 Lower  Bounds  of  Arithmetic  Operations 

The  major  advantage  of  LNS  is  its  ability  to  perform 
fast  multiplication  and  division.  Therefore,  in  addition 
to  the  need  for  achieving  the  lowest  possible  count  of 
operations , which  are  required  for  a given  algorithm,  it 
seems  plausible  to  try  to  replace  a number  of  additions  or 
subtractions  involved  in  a specific  algorithm  with 
multiplications  or  divisions.  An  extensive  survey  of 
literature,  related  to  this  area  of  optimization  [Kir77, 
Mor73,  Mot55,  Ost54,  Pan63,  Pan66,  Pan83,  Tod55,  Win68, 
Win70]  revealed  many  interesting  results,  which  will  be 
presented  below  along  with  some  efforts  of  ours  in  the 
area . 

The  design  and  analysis  of  arithmetic  algorithms  for 
Discrete  Fourier  Transforms  ( DFT ) , Convolution  of  Vectors 
(CV),  Cyclic  Convolution  of  Vectors  (CCV),  and  Matrix 
Multiplication  (MM)  have  been  examined  by  Pan  [Pan83].  He 
analyzed  the  arithmetic  and  logical  complexity  of  such 
algorithms  and  focused  on  the  classes  of  (bi)linear 
algorithms  defined  below.  If  I,J,K,i,j,k  are  nonnegative 
integers,  is  a matrix,  ( ^ j is  the  entry  of  /j  lying  in 
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row  i and  column  j,  P is  a field  of  constant  coefficients. 


X, 

Y, 

Z are  vectors  of 

indeterminates, 

x . 

l 

- (X). 

' yD  = 

(Y) 

j' 

zk  - (z)k'  xi  • yj  - 

zk  - 0 (unless 

VI 

o 

i < I, 

o < j 

< 

J, 

0 < k < K) , and  f i ^ 

, f . . , g p for 
' l ] k 

all 

i / j » k , 

then, 

following  Pan,  the  linear  and  bilinear  arithmetic 
computational  problems  are  defined  as 


Definition  8.1  A linear  problem  is  a set  of  linear  forms 
J-l 

ik(Y)  - E fj^Y^  for  k - 0,1,.  . .,K-1  (8.1) 

j-0 

Definition  8.2  A bilinear  problem  is  a set  of  bilinear 
forms 


1-1  J-l 

vx'  - E E£ijkxiYi  for  k = 0,1,.  . . , K— 1 (8.2) 

i-0  j-0 

Also  a linear  problem  can  be  equivalently  represented  by  a 
bilinear  form  b(Y,Z),  or  by  a matrix,  fj  such  that 

K— 1 

b ( Y,  Z)  = ^lk(Y)zk,  (//)jk  = fjk  for  all  j,k  (8.3) 
k=0 


Some  examples  of  linear  or  bilinear  problems  of  particular 
interest  to  Digital  Signal  Processing  are  given  below, 
where  m,  n are  nonnegative  integers. 

Problem  8.1  DFT.  Discrete  Fourier  Transform  at  n+1 
points . 


b ( Y,  Z) 


n 

E: 

k-0 


n 


E 

j-o 


j k 

w yj 


(p) 


jk 


co 


jk 
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where  w is  a (n+l)-root  of  unity. 

Problem  8.2  CV.  Convolution  of  two  vectors  or  polynomial 
multiplication. 

m+n  k 

t(X,  Y,  Z)  - E*k_jyj  (MX))jk  - X 

k=0  j=0 

where  = 0 if  i<0  or  i>m,  yj  = 0 if  j>n 

Problem  8.3  CCV.  Cyclic  Convolution  of  vectors  or 
multiplication  of  two  nth  degree  polynomials  in  (X), 
modulo  Xn+1  - 1. 


t(X,  Y,  Z) 


n 

D 

k = 0 


+ EViy: 


j-0 


j-o 


jk  = xk-j  if  i-k  (^(x))jk  = x otherwise 

j 

Here  j,k  = 0,1,.  .,n,  E = k+n+1. 

Problem  8.4  MM.  Matrix  Multiplication  (I=j=K=n^).  Now 
the  vectors  X,  Y,  Z,  are  represented  as  the  matrices 
X , Y , Z , so  that  (X).  = (X)a3,  ( Y)  j = meY,  (Z)k  = (Z)ya, 
i = cxn+0,  j = |3n+Y,  k=yn+a,  and  a,  |3,  y,  = 0,1,.  . .,n-l 


n-1 

t(x,  ■ 

Y , Z) 

- E 

o 

II 

1 >- 
i ca 

8 

)jk  = 

,x)»e 

("<x)' 

)jk  = 

0 

x«,eW 


if  j=0+yn. 


= Tr ( X , Y,  Z ) , 
y , a ' 

k= yn+a  for  some  a,(3,y 


othe  rwi se 
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The  (bi)linear  algorithms  are  natural  ones,  easy  to 

implement  and  analyze.  They  form  the  basis  of  the  fastest 

known  DSP  algorithms  and  all  other  DSP  algorithms  can  be 

turned  into  (bi)linear  ones  with  the  total  number  of 

operations  remaining  constant.  For  the  analysis  of  the 

arithmetic  complexity  of  linear  and  bilinear  algorithms, 

it  is  technically  convenient  to  estimate  separately  the 

total  number  of  additions/subtractions  C(— ) , nonscalar 

multiplications  C(*),  and  scalar  multiplications  C , 

s c 

which  are  involved  in  the  algorithm.  Assume  that  the 

determinates  (X)^  are  replaced  by  their  values  taken  from 

P.  Then,  of  course,  all  bilinear  algorithms  turn  into 

linear  ones  and  their  C(— ) does  not  increase.  An 

arbitrary  (bi)linear  algorithm  can  be  written  as  a 

sequence  of  linear  forms  in  the  input  variables  Y.  Then 

Pan  suggests  the  following  lower  bounds  for  C(— ) : 

C ( — ) + do(D(A))  = 0(n  lg  n)  for  Problems  8.1-8. 3 and 

2 

0(n  lgn)  for  Problem  8.4,  with  "disorder"  do(D(A)) 
defined  later  in  this  section. 

By  using  straightforward  substitution  techniques, 
Ostrowski  [Ost54]  showed  that  the  lower  bounds  on  C(  — ) 
range  between  K and  K+Q,  where  K and  Q are  the  total 
numbers  of  input  variables  and  linearly  independent 
outputs  of  the  algorithm.  This  means  that  n additive 
operations  are  necessary  for  polynomial  algorithms  of 
degree  n.  These  bounds  cannot  become  the  lowest  possible 
unless  severe  restrictions  are  imposed  on  the  model  of 
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computations,  but  then  no  constant  can  be  allowed  to  be 
greater  than  unity  [Mor73]. 

Pan  associates  acyclic  digraphs  D - D(A)  with  the 
(bi)linear  algorithms  A.  Their  vertices  have  outdegrees  2 
and  0 and  represent  the  + operations  of  A and  the  input 
variables  of  A respectively.  Then  C(— ) is  the  number  of 
vertices  of  D that  have  outdegrees  2.  In  order  to  estimate 
that  number,  all  vertices  of  D can  be  partitioned  into 
p ( D ) levels  ( p ( D ) being  the  "profundity"  of  D),  whose 
cardinalities  are  bounded  from  below  by  r (the  number  of 
linearly  independent  outputs  of  the  problem.  Then 
p(D)r  < C(— ) . If  the  desirable  partition  does  not  exist 
(due  to  some  deficiencies  in  the  structure  of  the 
algorithm  A represented  by  the  digraph  D ( A ) ) , then  C(— ) is 
augmented  to  include  measures  of  the  "irregularity"  of  D 
(ir(D)),  and  "disorder"  (quantitative  measure  of  the 
asynchronicity  and  structural  deficiency  of  the  algorithms 
according  to  others)  of  D (do(D)).  The  terms  ir(D)  and 
do(D)  can  be  alternatively  interpreted  as  measures  for  the 
additional  storage  required,  since  the  deficiencies  they 
represent  are  a result  of  precedence  problems  ([Tay83b] 
p.214).  Pan  shows  that  both  deficiencies  can  be  corrected 
by  transforming  D.  This  is  accomplished  by  joining  some 
appropriate  paths  to  it.  It  is  shown  that  a total  of 
ir(D)+do(D)  new  vertices  suffice  to  produce  a new  digraph 
with  no  irregularity  or  disorder  and  with  the  same 
profundity  of  the  old  one.  The  additive  complexity  has 
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become  now  C (— ) +i r ( D ) +do ( D ) . In  the  case  of  the  four  DSP 


problems  mentioned  above,  p(D)r  is  asymptotically 
proportional  to  ( K+Q ) lg ( K+Q ) . Finally,  an  estimate  of 


is  the  rank  of  an  arbitrary  polynomial  P(X)  and  r(m)  the 
rank  of  any  minor  m of  the  matrix  //(X),  that  defines  a 
bilinear  computational  problem,  then  it  is  shown  that 


A notion  of  rank  or  independence  for  arbitrary  sets 
of  rational  functions  is  developed  [Kir77].  This  rank 
bounds  from  below  the  number  of  additions  and  subtractions 
required  for  all  straight  line  algorithms  which  compute 
those  functions.  This  permits  the  uniform  derivation  of 
the  lowest  bounds,  which  are  known  for  a number  of 
familiar  sets  of  rational  functions.  These  bounds  are 
reported  below,  where  D denotes  now  an  arbitrary  integral 
domain,  F is  the  quotient  field  of  D,  P is  an  arbitrarily 
large  pool  of  distinct  indeterminates , A is  any  rational 


addition/subtraction  steps  in  A: 

Al . If  A computes  the  expression  a^  + a2  +.  . .+  an  then 

C(-)  > n-1 . 

A2 . If  A computes  the  expression 


C ( — ) from  below  is  offered  for  these  problems.  If  r(P(X)) 


algorithm  over 


and  C(— ) is  the  number  of 


al+a2+ * 


+a 


n 


then  C(— ) > n+m-2 


+a 


n+m 


If  A computes  the  pair  of  expressions  a^a^ 
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and  a-[a4  + a2a3  (the  real  and  imaginary  parts  of  the 
complex  product  ( a^+a^i ) ( a2+a4i ) ) then  C(— ) > 2. 

A4 . If  A computes  the  general  rational  function 


a xn+a  , x11  '''+.  . .+an 

n n-1  U 

b xm+b  . xm-1+ . ! . +bn 

m m-1  (J 


then  C(— ) > n+m. 


A5 . If  A computes  the  matrix-vector  product 


M 


then  C(±)  > N*(M)  - t. 


where  M is  a txn  matrix  over  D,  and  N*(M)  denotes  the 
number  of  columns  of  M that  are  not  identically  zero. 

A6 . If  A computes  the  matrix  product  AX,  then 

C(— ) > ( m+p-1 ) ( n-1 ) , where  A and  X are  mxn  and  nxp 
matrices  respectively. 

In  particular  this  gives  addition/subtraction  lower  bounds 

of  m(n-l)  for  the  product  of  an  mxn  matrix  with  an  n- 
2 

vector,  and  2n  -3n+l  for  the  product  of  two  nxn  matrices. 
A7 . If  A computes  the  matrix-vector  product 
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a 

n n-1  . 

n 

xi  xi  . . . xi  1 

• 

, with  X = 

• • • • 

al 

• • • • 
n n-1  ...  . 

XX  X 1 

. ao  . 

L m m m J 

then  C(  — ) >=  n+m-1. 

This  means  that  the  evaluation  of  an  nth  degree  polynomial 
at  m arbitrary  points  requires  at  least  n+m-1 

additions/subtractions . 

A8 . If  A computes  the  set  P^,.  • • ,Pj.f  w*th 

n ( i ) t 

Pi  = ^ aijx"^  ' * = > • • ■ »t  then  C(-)  > ^ n(i). 

j=0  i-1 

A9 . If  A computes  the  expression 

n n 

a^jX^y^  then  C(-)  > (n+l)^-l. 

i=0  j=0 


8 . 2 Preconditioning  of  Coefficients 


Motzkin  [Mot53]  introduced 

preconditioning  of  the  coefficients 

operations,  depending  only  on  the  a^ 

n 


computing  the  expression. 


E 


a . x 

l 


i 


the  notion  of 
He  showed  that  if 
in  the  course  of 

are  not  counted, 


then  only  about  n/2  multiplications  are  necessary  in  order 
to  compute  that  expression.  The  obvious  application  of 
this  result  is  when  the  same  polynomial  has  to  be  computed 
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at  many  different  points  x.  This  is  the  case  with  the 
inversion  of  a square  matrix  or  the  solution  of  a system 
of  n linear  equations.  It  should  be  mentioned  that  the 
substitution  algorithms  offered  by  Winograd  reduce  the 
number  of  multiplications  by  increasing  the  number  of 
additions  so  that  the  total  number  of  basic  operations 
remains  almost  constant.  This  is  not  desirable  in  a LNS 
environment.  There,  the  opposite  is  the  goal.  A technique 
to  minimize  the  multiplications  (by  increasing  the  number 
of  additions)  was  first  offered  by  Todd  [Tod55].  The 
reverse  line  of  reasoning  will  be  followed  in  order  to 
obtain  LNS  gains  in  the  following  example: 

Assume  that  the  polynomial 

4 3 2 

a^x  + a^x  + + a^x  + ag 

has  to  be  evaluated.  This  polynomial  can  be  written  as 

((bQx)x  + b-J  [(b2x  + b3)x  + a4j 

By  equating  the  coefficients  the  following  relations  can 
be  found  between  the  a's  and  b's: 


a4 

a3 

a2 

al 

a0 


b0b2 

b0b3 

blb2  + b0b4 
blb3 


blb4 


Then  for  &2 


and  for  any  choice  of  bg 


it 
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can  be  found  that 


0 


and 


This  formulation  results  in  a scheme,  which  requires  only 
three  additions,  while  the  number  of  multiplications 
becomes  five  (an  increase  of  one  with  regard  to  the 
optimal  Horner's  rule).  Notice  that  the  forming  of  the 


and  that  for  b q = 1 the  preconditioning  results  in  the 
savings  of  one  addition  and  one  multiplication.  If  this 
polynomial  evaluation  has  to  be  performed  for  many 
different  x's  then  the  preconditioning  of  the  coefficients 
in  the  described  way,  which  requires  five  multiplications 
and  one  addition,  will  prove  to  be  worthwhile  and  will 
contribute  to  savings  n speed  in  the  long  run.  Similar 
addition  count  reduction  schemes  can  be  found  for  other 
polynomials  as  well,  but  this  will  have  to  be  part  of  the 
preparation  for  the  execution  of  the  specific  application. 

8 . 3 Simultaneous  Computation  of  Polynomials 

The  idea  of  preconditioning  can  be  also  applied  for 
the  case  of  computing  together  the  values  of  several  fixed 
polynomials  at  the  same  point  (real  or  complex)  for 
several  values  of  the  indeterminate.  Such  is  the  case. 


factor  x 


2 


does  not  require  more  than  a mere  shift  in  LNS 
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for  example,  of  approximate  computation  with  increasing 

degrees  of  accuracy.  Such  "interlacing"  of  preconditioning 

schemes  has  been  reported  two  decades  ago  [Pan66].  The 

lower  bound  for  C(— ) was  given  as  N - s,  and  the  lower 

N— s + 2 

bound  for  C(*)  as  — ^ — • where  N is  the  total  number  of 
variable  and  independent  coefficients  and  s is  the  number 
of  polynomials  to  be  simultaneously  computed.  Some 
algorithms  for  the  construction  of  such  schemes  have  been 
suggested  and  analyzed  in  the  same  work. 


CHAPTER  NINE 
LNS  DSP  APPLICATIONS 


The  basic  LNS  unit,  whether  mechanized  as  an  ARP  or 
not  (AMP  designs  will  not  be  considered  furthermore  since 
they  are  inferior  to  the  ARP  ones),  will  be  capable  of 
performing  a limited  number  of  important  DSP  algorithms  in 
a Reduced  Instruction  Set  Computer  (or  RISC-like) 
environment.  In  either  case,  throughput  will  be  limited  by 
the  speed  of  its  components.  For  comparative  purposes, 
several  commercially  available  floating-point  processors 
of  short  wordlength  are  compared  in  Table  9.1.  Using  the 
Shottky  TTL  (2500LS)  design  as  a reference  and  parts  from 
the  same  semiconductor  family  plus  a 4K  35ns  ROM  for 
parameters,  the  estimated  performance  of  the  LNS  and  LNS- 
ARP  is  reported  in  Table  9.2.  It  can  be  noted  that  for 
add/subtract  intensive  operations,  conventional  floating- 
point and  LNS  processor  latency  are  comparable.  However 
in  real  multiply  intensive  environment,  a significant 
advantage  is  enjoyed  by  the  LNS.  This  advantage  becomes 
more  obvious  when  commanding  the  complex  arithmetic. 

The  entire  procedure  for  the  development  of 
applications  can  be  systematized  by  using  the  following 
steps : 
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TABLE  9.1 

Comparison  of  Commercially  Available  Short-Wordlength 
Floating-point  Processors 


Operation 

INTEL 

8087* 

(ms) 

WEITEK 
WTL  1032-5 
32-bits 

fc/k“100nS 

C/K(ns) 

AMD-MSI 
2500LS  ** 
16-bit 
mantissa 
(ns) 

Sky 

2910-based 

32-bit 

(ns) 

ADD 

.019 

910 

74-111 

derived 

5300 

SUBTRACT 

.140 

910 

74-111 

derived 

8500 

MULTIPLY 

.021 

910 

160 

derived 

5500 

DIVIDE 

.159 

N/A 

N/A 

15000 

SQRT 

.159 

N/A 

320 

66370 

* Based  on  10  sequentially  called  executions 
on  single  precision  data 
**  Fast  increment/decrement  = 10ns 


Fixed  point  : add  = 37ns,  multiply  = 150ns 
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TABLE  9.2 

Throughput  Estimates  for  Various  Basic  DSP  Operations 
Using  Conventional  and  LNS  processors 


Operation 

A 1 
Conventional 

B 2 
LNS 

(ns) 

A/B 

latency 

(ns) 

ADD, SUB 

111 

2(37)+4 5-119 

0.933 

R 

MULT 

160 

37 

4 . 324 

DIV 

320 ( est ) 

37 

8.649 

E 

SQRT3 

320 

1 0 ( S/R ) 

32.000 

SQUARE 

160 

1 0 ( S/R ) 

16.000 

A 

4 

RMS 

320+160L 
f 1 1 1 ( L-l ) + 3 2 0 = 
271L  + 529 

37L+119 ( L-l ) 
+10+37= 
156L  - 58 

-1.737 

L 

5 

FIR 

111  ( L-l ) +160L- 
271L-111 

119 ( L-l )+37L= 
156L-119 

-1.737 

C 

0 

ADD, SUB 

222 

239 

0.929 

M 

P 

L 

MULT 

4 ( 160 ) +222= 
1262 

4 ( 37 ) +2  39  = 
386 

3.269 

E 

X 

ABS 

2 ( 160 ) +111+ 
320  = 751 

2(10)  +119  + 10  = 
149 

5.040 

'''Based  on  survey  tables 


2 

Based  on  a 45ns  HMOS  ROM/RAM 
3Model:  2 x FLP  multiplication  delay 

4 

RMS  calculation: 

L 

^FIR  computation: 

i = l 


f L 

E 4 

.i  = l 


/ L 


a . x . 
i i 
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Step  1)  Precompile  the  application  algorithm  to 
maximize  the  multiplication/addition  tradeoff. 
Here,  techniques  introduced  in  Chapter  Eight, 
including  rearrangement  of  equations  and 
preconditioning  of  coefficients  can  be  used. 

Step  2)  Extract  maximum  parallelism  in  computations, 
through  the  use  of  the  mathematical  models  for 
precedence  relations. 

Step  3)  Compile  and  execute  a simulation  program  to 
verify  and  accumulate  statistics  and  resource 
requirements.  Such  a simulation  should  determine, 
among  others,  the  percentage  of  calls  of  specific 
memory  tables,  corresponding  to  certain  subranges 
of  an  ARP  design. 

Step  4)  Based  on  the  information  gathered  in  step  3, 
configure  the  system,  including  the  optimal 
resources  and  execute  the  application. 

The  LNS  has  already  been  shown  to  be  a viable  tool  in 
implementing  FFTs  and  linear  shift-invariant  filters. 
Several  basic  operations  required  for  digital  filter 
applications  are  suggested  in  Table  9.2,  where  an  Lth 
order  FIR  and  sums  of  products  algorithms  are  compared. 
However,  an  attempt  to  showcase  some  of  the  more 
interesting  non-obvious  DSP  applications  of  the  LNS  will 
be  made  below. 
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9 . 1 Multiple  Input  Adder 

Many  DSP  operations  are  intrinsically  sums  of 
products  statements.  Therefore,  there  may  be  some 
advantage  in  combining  derived  partial  product  terms  with 
a tree-adder.  The  LNS  components  developed  to  this  point 
accept  a maximum  of  two  operands.  Since  add  times  are 
dominant  in  LNS,  it  is  desirable  to  explore  potentially 
faster  multi-operand  adders.  Consider  a three  operand 
adder  configured  to  add  S = A + B + C.  Then 

s <-  lr(l  + rb-a  + rc-a j = a + s[(b  - a),  (c  - a)]  (9.1) 

with  S(x,  y)  = lr^l  + rx  + r^j 

The  lookup  address  space  would  have  to  be  sufficiently 
large  so  as  to  accommodate  the  concatenated  address  b-a, 
c-a.  An  alternative  expression  for  s is  the  following: 

s «-  a + lt(l  * rb-a  + ‘(b"c))  (9.21 

or 

s<-a  + $^a-b  - $(b-c)j  (9.3) 

Such  an  adder  can  be  realized  easily  in  LNS,  since  it  only 
requires  the  use  of  the  very  same  table  used  for  regular 
two-operand  addition.  Some  extra  routing  hardware  has  to 
be  provided,  especially  to  cover  the  case  of  a subtraction 
being  a sub-operation.  The  basic  architecture  of  such  a 
three-operand  addition  is  shown  in  Figure  9.1  for  a non- 
pipelined  operation.  If  the  hardware  has  to  be  optimized, 
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FIGURE  9.1 

Three-Operand  LNS  Adder 
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simple  rerouting  suffices  to  make  multiple  use  of  the  same 
hardware  elements  such  as  memory  tables.  The  advantage  of 
this  architecture  over  the  two-operand  addition  is  that 
one  magnitude  comparison  time  is  saved.  The  price  for  that 
lies  in  the  extra  bit  for  the  address  space  normally 
required  for  the  address  space  of  the  second  table,  but 
even  this  drawback  could  be  alleviated  by  appropriately 
adjusting  the  dynamic  range  of  the  LNS , to  exploit  the 
fact  that  the  output  of  the  $ table  is  always  bounded  by 
unity. 

9 . 2 CORDIC  Replacement 

By  use  of  a controlled  set  of  a prescribed  sequence 
of  conditional  additions  or  subtractions  the  CORDIC 
equations  can  be  used  to  approximately  solve  either  set  of 
the  following  equations: 


or 


Y'  = K (Y  cosX  + X sinX) 
X'  = K (X  cosX  - Y sinX) 


l = K (X2  + Y2)^ 
0 = tan-1 ( Y/X ) 


(9.4) 


(9.5) 


where  K is  an  invariable  constant.  Though  the  concept  of 
CORDIC  arithmetic  is  said  to  be  quite  old  [Vol59],  its 
implementations  and  applications  continue  to  evolve 
especially  in  areas  like  computer  graphics  and  analysis 
[Dag59,  Hav80].  The  acronym  comes  from  Volder's  coordinate 
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Rotations  Digital  Computer,  developed  in  1959  for  air 
navigation  and  control  instrumentation.  In  1971  Walther 
[Wal71]  generalized  with  elegance  the  mathematics  of 
CORDIC,  showing  that  the  implementation  of  a wide  range  of 
transcendental  functions  can  be  fully  represented  by  a 
single  set  of  iterative  operations.  All  operations  are 
based  on  the  execution  (in  a multiplier-free  digital 
system)  of  either 


: . , = x . + y . 2 

l+l  l - 1 l 


-l 


(9.6) 


or 


x(i+l) ,2  " (1  + y2  1)x(i+l) ,1 


While  the  first  of  these  equations  represents  the  regular 
(Rotation  and  Vectoring)  CORDIC  operations,  the  second 
forces  the  scale  factors  of  circular  and  hyperbolic 
functions  to  unity.  ROM  based  instructions  govern  the 
selection  of  either  the  first  or  the  second  equation.  The 
architecture  suggested  by  (9.6)  is  multiplier-free  and 
uses  only  elementary  shifts  or  adds.  It  implies  a variable 
execution  time  ranging  from  a very  short  delay  to  one  of 
the  order  of  F shift  and  add  delays,  where  F is  the  number 
of  bits  of  the  dataword. 

For  a system  with  a fast  multiplier,  such  as  the  LNS , 
the  direct  calculation  for  the  variables  is  proved  to  be 
faster  and  more  accurate  and  regular  from  a data  flow 
aspect.  The  generation  of  polar  from  rectangular 
coordinates  (Vectoring)  is  considered.  Then  the  equations 
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(9.5)  have  to  be  computed  (for  K = 1).  The  efficient 
generation  of  R in  the  LNS  has  been  already  reported  in 
Table  9.2,  while  the  production  of  0 can  be  accomplished 
through  the  use  of  a series  expansion  of  the  form 


0 


tan  1(S) 


1 ” S + 


SJ 

3 


S - 


T~ 


♦f- 


-»-■=  + 


y~ 


S5  + S7  + 
V~  + l~  + 


T~  + • 

s5  s 

f + 


; S > 1 


; S < 1 


(9.7) 


; S < -1 


All  of  the  exponentiations  S-1  and  the  divisions  by  j , j 

1,3,5,.  . .,  which  make  up  for  about  two  thirds  of  the 

total  computational  operations  count,  are  very  efficiently 

and  accurately  performed  in  LNS  by  simple  shifts  and 

additions.  Therefore,  the  LNS  polynomial  implementation  of 

the  trancedental  functions  is  a viable  and  exciting 

alternative  to  CORDIC  vectoring.  On  the  other  hand,  as 

Appendix  A suggests,  the  hyperbolic  trigonometric 

functions  can  be  elegantly  implemented  in  LNS,  resulting 

in  the  same  kind  of  savings.  The  architecture  of  a 

vectoring  processor  is  abstracted  in  Figure  9.2  and  the 

LNS  procedure  for  polynomial  evaluation  of  the  angle  0, 

where  S < 1 and  the  length  of  the  power  series  terms  is 

L,  is  given  in  Table  9.3.  This  computational  procedure 

can  be  microcoded  and  invoked  during  run-time.  A latency 

estimate  is  also  computed,  based  on  the  assumption  that 

the  time  T /c  required  for  addition  or  subtraction  is 
A/  o 
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R 


R = (x2  + y2)^ 

9 = power  expansion  of  equation  (9.7) 


FIGURE  9.2 

LNS  CORDIC  Architecture 
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TABLE  9.3 

LNS  Arctan-Vector ing  Procedure  (see  Figure  9.2) 


Tine 

Operation 

Cont  rol 
1 

Pi 

c . 

1 

Control 

2 

P2 

0 

Subtract  y-x 

0 

s 

0 

N/A 

0 

1 

Iterate 

N/A 

3s 

c 3 

N/A 

0 

2 

for 

• 

3s 

c 3 

1 

6 

5 

angle 

• 

3s 

c 3 

-1 

P2+ ( 3s  + c j ) 

8 

• 

• 

5s 

c5 

1 

P2+( 5s+Cg ) 

3L-6 

• 

N/A 

( 2L+1  )s 

c2l+1 

(-I)""2 

P2+(-l)L( (2L-3)s+c2l_3J 

3L-3 

• 

1 

2x 

0 

<-l)L 

P2-i-(-1)L+1(  (2L-1)«+c2l1  J 

3L 

* 

2 

2y 

* 

1 

P2+(-1)L+3[(2L+1)s+c2l+1) 

3L+1 

Output  0 

N/A 

N/A 

• 

1 

P2+(-1)L+3[(2L+1)s+c2l+1J 

3L  + 3 

N/A 

N/A 

• 

1 

2x 

3L  + 6 

N/A 

N/A 

• 

1 

2x+2y 

3L+7 

Output  R 
via  S/R 

N/A 

N/A 

" 

1 

2x+2y 
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three  ti 

mes 

the 

time  tm/d 

required  for 

multiplicat 

:ion 

or 

division . 

If 

the 

number  of 

significant 

terms  in 

(9. 

7) 

that  are 

chose 

n to 

satisfy  a 

prespecified 

precision 

metr 

ic 

is  L,  then  the  developed  latency  model  yields  a latency  of 
( 3L  + 6)  tm/d  f°r  the  vectoring  operation.  The  same 
operation  performed  using  a 8088/8087  processor  pair, 
would  require  about  17  multiply  cycles  to  compute  the 
angle  and  an  additional  10.7  cycles  to  compute  the  radius; 
a total  of  20.7  multiply  cycles.  It  can  be  observed  that 
the  LNS  approach  is  faster  and  more  precise  than  the 
8088/8087  pair.  Experimental  data  found  in  Table  9.4  and 
produced  using  the  code  found  in  Appendix  N,  support  the 
above  analysis  by  showing  that  the  relative  error  in  the 
computation  of  the  angle  and  the  radius  by  polynomial 
expansion  using  LNS  is  significantly  smaller  than  the  one 
resulting  from  a FXP  implementation  of  the  CORDIC 
technique . 


9 . 3 Pseudo  Wigner-Ville  Distribution 


The  Wigner-Ville  Distribution  (WVD)  [Cla80,  Fla84a, 
Vil48],  and  its  evolved  version  (the  Pseudo-WVD  (PWVD)) 
are  very  good  candidates  when  a time-frequency 
representation  of  a non-stationary  signal  is  needed. 
Historically  the  Short  Term  Fourier  Transform  ( STFT ) 


[A1177],  has  been  the  main  analysis  tool  for  studying 
signals  with  time-varying  spectra.  The  STFT  is  premised  on 
the  so-called  "constant  frequency  assumption."  Therefore, 
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Comparison  of  Errors 
Coordinates  to  Polar 
Polynomial  Expansion 


TABLE  9.4 

in  Transforming  the  Rectangular 
Ones,  Using  CORDIC  and  Through 
Using  FXP  and  LNS  Respectively 


Relative  Error  x 

IQ"6 

Fractional  Bits 

Angle 

0 

Radius 

R 

FXP 

LNS 

CORDIC 

FXP 

LNS 

CORDIC 

FXP 

LNS 

CORDIC 

27560 

876 

350359 

20550 

314 

285570 

9 

4 

4 

4369 

705 

19543 

3932 

75 

6798 

12 

7 

7 

887 

705 

3266 

556 

64 

618 

12 

10 

10 

835 

705 

1703 

212 

64 

297 

12 

11 

11 

686 

705 

908 

139 

64 

147 

12 

12 

12 

The  Relative  Error  is  computed  as  a noise-to-signal  ratio, 
obtained  by  subtracting  the  results  of  the  CORDIC  tecnique 
and  the  FXP  and  LNS  polynomial  expansion  values  from  the 
ones  obtained  using  polynomial  expansion  with  floating- 
point precision,  and  then  dividing  the  differences  with 
the  latter. 
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the  applied  time-domain  window  must  be  sufficiently  short 
over  this  interval,  for  the  spectral  signature  to  remain 
constant.  A more  general  class  of  signals  is  the  one  of 
linear  frequency  (i.e.,  9/(t)/3t  - Cq  + c^t).  This  class 
of  signals  can  be  efficiently  processed  using  the  PWVD  and 
a Gaussian  window  (which  will  force  the  Wigner  quasi 
probability  density  function  to  be  nonnegative).  In  the 
discrete  case,  the  Wigner  distribution  is  given  by 

N/2 

W ( j , k ) - 2 x( j ,n)w(n)w*(-n)W^k  (9.8) 

n=-N/2 

where  w(n)  is  a real  window  function,  = exp{-2nj/N}  and 
x(j,  n)  stands  for  the  Wigner  computational  kernel 
x(3+n)*x  (j-n).  Willey  et  al.  [Wil84]  proposed  a systolic 
architecture  for  the  implementation  of  (9.8)  for  the 
unwindowed  case.  It  can  be  observed  that  the  PWVD  is  a 
complex  multiply  intensive  statement.  The  Wigner  kernel 
(windowed  or  not)  is  shown  to  be  a massively  complex 
multiplier  intensive  task.  The  data  exhibited  in  Table 
9.2  support  the  throughput  advantage  to  be  gained  by  LNS 
complex  multipliers.  To  produce  a Wigner  kernel  would 
require  N complex  multiplications,  N real  multiplications 
and  N complex  additions.  Using  the  data  from  Table  9.2  it 
is  found  that  the  PWVD  kernel  can  be  created  in  the 
following  times  (in  nanoseconds)  for  a conventional  FLP 
system  and  LNS  respectively 
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Conventional 

N( 1262  + 320  + 222)  = 1804  N 

LNS 

N( 386  + 74  + 239)  - 699  N 

In  other  words  LNS  is  approximately  2.5  faster  than 
conventional  FLP  processors.  Of  course  the  produced 
kernel  can  be  processed  by  a standard  N-point  FFT.  The 
latter  has  been  studied,  for  the  LNS  case,  by  Swartzlander 
et  al . [ Swa83 ] . 

Recently,  Flandrin  et  al . [Fla84b]  published  a 
variation  on  this  theme  by  considering  the  signal  to  be 
analytic.  More  specifically,  the  real-valued  signal  x(n) 
in  (9.8)  is  mapped  through  a Hilbert  transform  into  the 
series  z(n)  = x(n)  + jxu(n),  where  x„  is  the  Hilbert- 
transformed  signal  defined  (discrete  case)  by 


xR(n) 


E 

m*n 


x ( n )• 


sin  n(m  - n)/2 
n(m  - n)/2 


(9.9) 


If  the  temporal  data  extend  by  +M  delays  about  some  sample 
t,  and  if  a N-point  harmonic  spectrum  is  desired,  then  its 
Wigner  signature  is  given  by 


W( t , k ) = 4 Re 


N-l 


M-l 


w1  k I hN  ( i ) | 2 gM(m)z(  t+m+i  )z*(  t+m-i  ) 


i = 0 


m— M+l 


-2  | z ( t ) 


2 


(9.10) 
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where  hN  and  gM  are  frequency  and  temporal  windows 
respectively.  The  formula  (9.10)  allows  for  obtaining  a 
2N-point  PWVD  through  a N-point  FFT . A simplistic 
systolic  WVD  (not  considering  a window  function)  has  been 
reported  [Rab74].  Flandrin  proposed  a conventional 
architecture  for  implementing  (9.10)  using  a TMS320 
processor.  A more  elegant  and  faster  systolic 
architecture,  based  on  the  developed  LNS  units  is 
presented  in  Figure  9.3,  where  for  clarity  only  the  case 
N=3,  M=3  is  presented.  Systolic  primitives  are  utilized  to 
implement  the  mapping  s <-  s + ab  required  by  the 
multiplier-intensive  operations.  This  architecture  is 
completely  modular  and  general,  in  the  sense  that  it 
accepts  any  desirable  definition  of  window  functions  h(i), 
g(i).  The  implementation  of  the  second  window  is 
particularly  exciting.  The  architecture  in  Figure  9.3  is 
shown  to  be  fast  and  fully  utilizing  the  attributes  of  the 
LNS  processor.  The  LNS  approach  offers  as  an  side 
advantage  the  ability  to  alter  the  shape  of  certain  window 
functions  without  any  need  for  reprogramming.  The 
preferred  window  function  is  the  Gaussian  window, 
mentioned  above,  and  given  by 


G ( i ) 


c e 


/ . 2 / 2 , 
(-i  /a  ) 


(9.11) 


where  c and  a are  design  parameters.  The  LNS-Wigner 
processor  would  be  presented  with  the  value  of 
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N = M = 3 data  explicitly  shown  for  z = 0 

all  data  paths  are  complex  INS  two-tuples  R * jl  - ( r i) 


sp/2  *■  sp/2+9(1)vp(i);  p=0,2,4 

[2  real  binary  adders J 
[2  real  LNS  adders] 


FIGURE  9.3 

LNS  Systolic  Wigner  Processor  Architecture 
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g ( i ) = 1 r Tg ( i ) 1 = lr(c)  + m(i)  ; m(i)  = — 1 r ( e ) (9.12) 

v ' a 

If  a is  reassigned  a value  a'  = da;  d > 0,  then  the  new 
window  weight  is  given  by 


g' ( i ) = g( i ) 

+ m(  i ) 

1 - d 

n?~"J 

(9.13) 

is  selected 

in 

such  a 

way  that 

) / j^2±k  + m(  i ) j ** 

for 

any  k,  then 

the  new  window 

is  obtained  by  a 

simple 

shift  of  the 

old  weight  by 

k bits. 

9 . 4 Multiplicative  FIR  Filters 

The  linear  Finite  Impulse  Response  filters  (FIR)  play 
an  important  role  in  contemporary  DSP.  They  are  proven  to 
be  very  effective  in  implementing  linear  phase  digital 
filters.  Linear  phase  behavior  can  be  ensured  by 
satisfying  some  straightforward  coefficient  symmetry 
conditions.  However,  compared  to  an  Infinite  Impulse 
Response  (HR)  filter,  an  equivalent  FIR  will  be  of  much 
greater  order.  Rabiner  et  al.  [Rab74]  have  published 
empirical  results  which  indicate  that  an  FIR  can  be  of  an 
order  several  orders  of  magnitude  bigger  than  the  one  of 
an  equivalent  IIR  (having  a similar  ideal  frequency 
response).  In  other  words,  the  enhanced  frequency-phase 
characteristics  of  the  FIR  filter  are  obtained  through  a 
degraded  complexi ty-throughput 


performance . 


Low 
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complexity  would  require  a f ew-multiplier  (probably  time- 
multiplexed)  tap  coefficient  generation,  while  advanced 
throughput  requires  many  concurrently  operating 
multipliers . 

Recently,  an  alternative  design  methodology  was 
proposed  by  Fam  [Fam81]  and  extended  by  Taylor  [Tay84]. 
Known  as  the  Multiplicative  FIR  filter  ( MFIR ) , it  is  based 
on  the  use  of  the  identity 


2P-1  P-1 


e *i  - n 


+ X 


(9.14) 


Replacing  x by  the  quantity  Az  ^ (z  is  the  usual  z- 
transform  variable),  the  following  FIR  transfer  function 
results : 


2 


P-1 


E KT  - 


p-l 

; w.  ( A , z ) 
i = 0 


(9.15) 


where  coi(A,z)  = 1 + (^Az-^j^  . The  impulse  response  of  a 
system  described  by  equation  (9.15)  is  of  exponential 
nature,  given  by  the  times  series  { x ( i ) } = {A-1},  for  j 
assuming  nonnegative  integer  values.  The  synthesis  of  a 
more  general  impulse  response  would  require  many 
concurrently  operating  MFIR  stages.  The  potential 
advantage  of  (9.15)  is  that  the  operations  count  is  much 

p 

reduced.  The  conventional  way  requires  2 coefficient 

P—1 

scalings  and  2 


additions,  whereas  the  MFIR  requires 
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CONVENTIONAL  ARCHITECTURE 


LNS  ARCHITECTURE 


FIGURE  9.4 

LNS-MFIR  Architecture 
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only  P of  those.  An  MFIR-LNS  architecture  is  proposed  in 
Figure  9.4,  where  a = lr(A).  It  is  well  suited  for  this 
case,  because  LNS  can  also  take  advantage  of  the  fact  that 
the  necessary  forming  of  powers  (i.e.  A-  ; k = 2 ) can  be 
done  through  elementary  data  shifts  of  the  LNS  exponent  of 
A.  Taylor  [Tay84]  has  offered  a design  of  an  arbitrary 
transfer  function  by  using  a set  of  MFlRs  connected  in 
parallel  and  satisfying  a ^ optimization  criterion.  The 
architecture  in  Figure  9.4  is  seen  to  possess  a regular 
and  modular  form,  suitable  for  VLSI  design. 

9 . 5 Echo  Cancellation 

Echo  in  telecommunications  is  undesirable  and 


unavoidable  [Gri84]. 

It 

i s the 

result 

of  impedance 

mismatch  problems  of 

the 

energy 

couplers 

between  the 

four-wire  circuit  of  the  satellite  communications  link  and 
the  two-wire  local  circuit.  Therefore,  some  energy  on  the 
satellite  branch  is  coupled  to  the  local  branch  and 
returned  to  the  source  as  an  echo.  This  echo  (typically 
11  dB  down  from  the  original  signal)  requires  special 
treatment  and  it  is  usually  done  through  an  echo  canceller 
[Wid75,  Son67].  The  suspiciously  sounding,  to  the 
uninitiated,  concept  of  echo  cancellation — "To  remove  the 
echo  subtract  it,"  is  based  on  mimicing  the  echo  mapping 
function  in  the  echo  canceller  device,  which  is  connected 
in  parallel  to  the  echo  path  and  synthesizes  a replica  of 
the  echo,  which  is  then  subtracted  from  the  combined  echo 
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and  near-end  speech  signal  to  obtain  the  near-end  signal 
alone.  The  identification  of  the  mapping  function  of  the 
echo  path  is  mostly  done  through  an  adaptive  filter  of  one 
form  or  another  and  gradually  matches  the  impulse  response 
of  the  actual  echo  path.  The  principal  algorithm  in  use 
today  (mainly  because  of  its  simple  structure  and 
implementation)  for  echo  cancellation  is  the  Least-Mean- 
Squares  or  LMS  algorithm  [Wei79,  Wid76].  The  LMS  echo 
canceller  is  shown  in  Figure  9.5,  where  (x(n)}  is  the 
far-end  signal  source,  (y(n)}  is  the  signal  on  the  return 
path  (which  includes  echo  and  the  near-end  signal  (v(n)}), 
(£(n)}  is  its  estimated  version  and  (e(n)}  is  the  error 
signal  that  serves  as  feedback  to  the  adaptive  filter. 
The  samples  of  the  above  waveforms  at  sampling  instant  kT 
are  given  by  x(k),  y(k)  and  e(k)  respectively.  K is  the 
loop  gain  or  step  size  of  the  echo  canceller,  N is  the 
number  of  taps  in  the  echo  canceller  and  h^(k),  k = 1,.  . 

. , N are  the  tap  weights.  The  feedback  signal  of  the  echo 
canceller  is  given  by 


e ( k ) = y ( k ) - $(k) 


N-l 

= y( k ) x( k-l )h( i , k ) 


i = 0 


(9.16) 


i = 0 


N-l 

- y ( k ) - x(k-l)  [h(i,k-l)  + 2Kx ( k-l-i ) e ( k-l ) ] 
i = 0 
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FIGURE  9.5 

General  Model  of  Echo  Canceller 
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By  adjusting  the  loop  gain  K,  the  problems  of  variation  in 
far-end  signal  power  and  the  double-talk  can  be  addressed. 
It  can  be  seen  that  this  algorithm  is  very  simple  and 
requires  only  0(2N)  multiplications  and  0(2N)  additions 
per  iteration  for  an  N-tap  filter.  This  makes  it  an  ideal 
choice  if  the  results  of  different  implementation  of  a 
general  algorithm  via  several  arithmetic  systems  are  to  be 
considered  in  a balanced  operation  environment.  Some 
experiments  were  performed  along  this  line,  through 
simulation,  according  to  the  source  codes  found  in 
Appendix  0.  An  experiment  was  held  for  each  one  of  the 
floating-point,  fixed-point  and  logarithmic  echo  canceller 
implementations.  The  far-end  signal  x(k)  was  assumed  to 
be  Gaussian  white  noise  with  a mean  of  zero  and  limited 
between  +2^,  with  p being  defined  interactively.  The 
return-signal  was  supposed  to  be  given  by  the  product 
Ax(k-delay),  where  the  constant  A and  the  delay  were  again 
defined  interactively.  The  criterion  of  comparison  was 
the  rate  of  convergence  of  the  LMS  algorithm,  determined 
by  the  Mean  Square  Error  (MSE).  For  an  ensemble  size  of  L, 
the  MSE  is  given  by 


MSE ( k ) 


E e?(k) 

i = l 


(9.17) 


where  e^(k)  is  offered  by  equation  (9.16).  In  Figure  9.6 
the  variable  MSE ( k ) versus  time  is  plotted.  The  fractional 
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FIGURE  9.6 

Convergence  Curves  for  Three  Echo  Canceller  Implementations 
a)  Floating-point,  b)  Fixed-point  and  c)  LNS . 

Ensemble  Size  was  24 
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wordlengths  used  in  the  experiments  were  16-p-l  for  the 
FXP  canceller  and  16-lr(p)-2  for  the  LNS  one,  so  that  they 
would  cover  equivalent  dynamic  ranges.  The  values  of  the 
parameters  involved  were  K = 0.0125,  A - 0.3,  the  number 
of  samples  was  300,  the  number  of  taps  N = 50,  and  the 
ensemble  size  L = 24.  It  can  be  observed  that  the  64-bit 
accurate  FLP  system  (solid  line)  performs  best  of  all 
(being  slightly  better),  if  it  were  assumed  that  all  of 
the  three  systems  require  the  same  time  for  the  execution 
of  the  operations  involved.  According  to  Table  9.2 
though,  this  is  not  the  case.  In  absolute  time  units  the 
LNS  echo  canceller  proves  to  converge  at  least  1.72  and 
1.20  times  faster  than  its  FLP  and  FXP  counterparts. 


CHAPTER  TEN 

IMPACT  OF  DESIGN  ON  ARCHITECTURE 


Whereas  the  basic  LNS  unit  is  architected  with 

internal  (on-chip)  memory,  it  is  unrealistic  to  assume 

that  a useful  ARP-LNS  device  could  be  architected  with 

multiple  high-density  large  ROM  arrays  on  the  same  chip. 

Most  probably  it  will  be  architected  with  external  memory, 

data  paths  and  a central  control,  communication  and 

5 

computation  chip,  or  C unit.  It  will  also  contain  a 
certain  amount  of  fast  cache  memory.  The  C^  architecture 
is  shown  in  Figure  10.1,  where  the  multiple  internal  paths 
support  the  following  DSP  operations  supported  by  the 
indicated  resources: 


Operations 

Resources 

Multiply/divide 

Adde  r 

Add/subtract 

1 2 
Controller  , Near-Zero  PLA 

Comparator,  Adder,  ARP 

External  memory 

Square/Square  root 

Shift  Register 

The  ARP  controller  also  checks  for  essential  zeros. 
Near-zero  $ or  Y mappings  are  "many  to  few." 
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LOOKUP  ADDRESS  DEVICE 


FIGURE  10.1 
5 

C Architecture 


146 


The  external  memory  included  in  the  ARP  design 
presents  some  outstanding  opportunities  as  well  as  some 
hazards  when  configured  as  two  or  three-dimensional  memory 
architectures . 

If  permitted  by  the  specific  application  and  its  data 
flow  the  C5  architecture  could  allow  for  an  LNS 
addition/subtraction  scheme,  which  would  be  free  from  most 
memory  dependencies.  This  can  be  done  by  first  performing 
all  the  non  additive  operations,  convert  the  results  into 
a different  number  system  ( FLP , for  example,  which  does 
not  require  any  memory  support)  and  use  efficient 
operation— structuring  methods,  like  trees,  to  perform  fast 
additions.  If  it  is  required,  the  results  could  be  again 
converted  into  LNS.  Of  course,  this  addition  scheme  can 
be  justified  only  if  the  length  of  the  addition  sequence 
is  large  enough  to  write  off  the  conversion  overhead. 

10.1  Two-Dimensional  Memory  Architectures 

A C^-ARP  would  communicate,  on  demand,  to  one  of  a 
possibly  large  number  of  chips.  This  communication  can  be 
accomplished  using  a variety  of  possible  networking 
schemes.  At  the  high  end  of  switching  complexities  are 
cross-bar  networks.  At  the  lower  end  are  various  networks 
[Fen81]  like  cubic  rings,  gamma  networks,  etc.  Still, 
methods  based  on  packet  switching  [Dia81]  may  be  used.  In 
a Dance  Hall  architecture,  shown  in  Figure  10.2,  switches 
are  used  to  arbitrate  memory  tables  to  the  ARP  engines. 
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FIGURE  10.2 

Dance  Hall  Architecture 
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The  memory  part  consists  of  a total  of  KT  chips  organized 
along  rows  of  identically  programmed  ROMS.  Each  row  is 
responsible  for  LNS  table  look-ups  from  one  of  K ARP 
intervals.  Within  each  row  there  are  T repetitions  of  the 
same  table.  During  run-time,  an  ARP  processor,  if 
performing  an  addition  requiring  a table  look-up  support, 
will  access  one  of  the  KT  chips  by  issuing  a select 
command.  This  command  will  have  to  specify  a row 
(according  to  the  ARP  interval),  enable  an  available  chip 
within  that  row  and  finally  present  an  address  to  the 
enabled  chip.  It  is  possible  to  define  the  number  of 
memory  rows  and  the  number  of  identical  tables  within  each 
row  statistically,  so  that  there  is  an  optimum  utilization 
of  these  tables.  For  example,  for  a three-partition  case, 
an  ARP  would  perform  an  addition  by  accessing  one  of  the 
three  memory  rows,  corresponding  to  the  partition  levels, 
according  to  the  addition  address.  If  the  three  tables  are 
called  /i 2 and  /j 3 and  if  experimentation  has  determined 
that  60  percent  of  the  calls  are  for  /j .,  30  percent  for  jij 
and  10  percent  for  then  for  a KT  = 10  chip  design, 
the  most  reasonable  assignment  would  be  the  one  allocating 
6 chips  for  the  first  row,  3 for  the  second  and  1 for  the 
third  one,  so  that  the  number  of  memory  calls  would  match 
statistically  the  number  of  available  resources.  It  is 
also  reasonable  to  assume  (even  without  experimentation) 
that  the  number  of  tables  required  to  be  dedicated  to  the 
same  address  subrange  will  decrease  as  the  subrange 
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approaches  the  essential  zero,  especially  if  presorting  of 
the  addresses  has  already  taken  place.  For  optimal 
utilization,  (as  mentioned  in  the  third  step  of  the 
procedure  for  preparation  of  applications  given  in  Chapter 
Nine)  the  total  number  kT  of  tables  can  be  determined  by 
using  simulation  to  accumulate  statistics  regarding  the 
application . 

10.2  Three-Dimensional  Memory  Architectures 

Because  of  the  regularity  of  the  ARP  machine  and  the 
memory-data  flow,  the  issue  of  integration  can  be  taken 
to  the  next  higher  level.  The  previously  developed  memory 
architecture  is  basically  planar  (two-dimensional).  What 
is  now  possible  is  to  develop  the  3-D  system  suggested  in 
Figure  10.3.  Besides  communicating  in  a traditional  plane 
(card  level),  multiple  and  regular  paths  are  opened 
through  plural  intracard  connectors  in  the  third  dimension 
(3-D  space  level).  These  paths  have  a bandwidth  far  in 
excess  of  that  obtainable  by  using  standard  edge 
connectors.  Using  this  architecture,  effective  nearest 
neighbor  techniques  can  be  implemented  for  ring  and  cube 


networks . 
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The  processor  cards 
are  regular  in  their 
data  flow  and 
architecture.  They 
can  be  sandwiched 
into  a multi-plane 
system.  Data  are 
bussed  ribbons  which 
connect  adjacent  ARPs 
to  their  nearest 
neighbor  cells. 

Nearest  neighbor 
search  rules  will 
be  used  for 
address  assignment. 
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FIGURE  10.3 
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C ARP  LNS  Architecture 
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FIGURE  10.4 

Intraboard  Communications  of  3-D  Architectures 
via  Cell  Interfaces. 
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10.3 


Fault  Tolerance 


Globally,  fault  tolerance  can  be  designed  into  the 

system  by  using  a number  of  Error  Correcting  Codes  (ECC) 

techniques  and  redundancy.  However,  an  additional  fault 

5 

tolerance  dimension  can  be  achieved  by  attaching  the  C 
units  to  a multidimensional  memory  array  as  the  one  shown 
in  Figure  10.4.  It  can  be  seen  there  that  a dual-port 
memory  is  associated  with  each  processor,  with  one  of  its 
components  being  its  primary  choice  for  table  look-ups, 
while  the  other  one  (belonging  to  an  adjacent  cell)  is  its 
secondary  choice.  The  following  fault-overcoming  scheme 
can  be  employed.  At  the  memory  level,  if  an  access  to  a 
memory  table  fails,  data  can  be  moved  around  the  periphery 
of  the  memory  table  to  a neighbor.  If  a dual-port  memory 
(primary)  fails,  look-up  operations  can  be  assigned  to  its 
adjacent  table  (secondary)  or  passed  through  the 
complementing  cell  processor  to  its  tables,  or  exported  to 
an  adjacent  cell. 

It  is  obviously  desirable  to  be  able  to  design  a 
system,  which  can  be  rearchitected  during  run-time.  The 
key  to  such  a design  is  the  ability  to  populate  it  with 
identical  processor  chips.  Reconfiguration  can  be  used  for 
either  rearchitecting  a system  for  efficient  execution  of 
tasks  or  overlaying  a degree  of  fault  tolerance  on  a 
design.  Instead  of  following  the  traditional  notion  of 
replacing  fault  components  in-kind  or  allowing  for 
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graceful  degradation,  one  could  try  to  explore  another 
philosophy,  borrowing  elements  from  biology 

"The  ideal  (or  at  least  an  excellent) 
reconfiguration  system  is  the  human  body.  If  she 
or  he  loses  a subsystem  (e.g.,  sight)  then  the 
body  will  try  to  replace  it  in-kind  (e.g,  tissue 
regeneration)  and  lacking  this  ability,  it  will 
rewire  the  system  to  use  the  information  of  the 
functioning  subsystems  more  efficiently  (e.g., 
increased  reliance  on  hearing)." 

That  is,  instead  of  trying  to  replace  the  lost  components 

with  reallocated  hardware,  determine  if  this  hardware 

reserve  may  be  better  used  by  increasing  the  power  of  a 

distinctly  different  subsystem.  For  example,  if  the 

5 

memory  data  or  address  lines  fail,  the  C can  still  be  of 
use  by  labeling  it  as  supporting,  say,  only 
multiplication,  division,  squaring  and  other  non-memory- 
intensive  operations,  instead  of  trying  to  force  it  back 
to  a complete  instruction  set  processor. 

10.4  Memory  Assignment  Schemes 

Several  memory  assignment  schemes  are  candidates  for 
employment.  Among  them  are  a stack  pointer  approach  (on- 
chip)  and  a nearest  neighbor  one  that  would  assign  to  the 
LNS  processor  the  "closest"  memory  chip.  The  first 
approach  (on-chip  stack  and  microcontroller)  allows  for 
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considering  a novel  method  for  performing  branching  or 
servicing  interrupts. 

Suppose  that  a task  is  being  processed  until  an  event 
occurs,  which  requires  the  temporary  termination  of  that 
task  and  the  execution  of  the  "new  task"  to  commence. 
Classical  wisdom  would  suggest  that  the  states  of  the 
processor  be  saved  for  the  old  task,  the  new  task  be 
loaded  and  executed  and  finally  the  old  task  be  restored 
and  execution  continued.  However,  based  upon  mathematical 
models  (like  the  precedence  relations)  and  the  fact  that 
the  ARP  design  will  probably  provide  for  an  on-chip 
microcontroller,  it  is  possible  to  conceive  a policy  that 
would  allow  for  transfer  of  control  from  the  old  to  the 
new  task  and  back  again  with  a minimum  of  real  time  delay. 
This  could  prove  to  be  a critical  design  objective  in 
designing  programmable  signal  processing  systems,  where 
the  algorithm  remains  constant  but  the  source  of  the  input 
time  series  and  the  coefficient  set  can  be  defined  by 
conditional  branching  tests  or  external  control.  In  a 
multi-ARP  system,  one  could  start  with  the  longest  delay 
path  to  the  output,  complete  that  calculation  while  the 
next  system  set-up  values  are  preloaded  (by  using  probably 
doubly-buffered  on-chip  memory  blocks)  into  the  ARPs 
located  at  that  level.  Once  the  data  exported  from  this 
longest  delay  path  are  assimilated  to  the  next  level, 
preload  that  stage  with  data  from  the  interrupting  task 
and  continue.  Using  this  method,  it  would  appear  as 
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though  valid  outputs  could  continuously  flow  from  the 
processor.  In  other  words  this  interrupt  servicing 
technique  introduces  no  latency. 

Using  the  ARP,  fast  compact  DSP  systems  can  be 
designed,  which  will  be  able  to  operate  over  a large 
dynamic  range  on  a low  error  budget.  Many  of  these  designs 
will  comprise  a number  of  identical  and  interconnected 
ARPs.  These  processor  chips  will  have  to  share  data.  This 
sharing  can  take  place  in  globally  assigned  shared  memory, 
or  through  processor  to  processor  communication.  In 
general,  for  a design  involving  p processors  (with  p 
roughly  in  the  range  2 < p < 32),  being  able  to  fetch 
vectors  of  arbitrary  length  n from  either  local  or  shared 
global  memory,  Gannon  et  al.  [Gan84]  found  that  the 

execution  time  of  dyadic  operations  is 

-1 , G , 

r»  <n  + 

if  either  operand  is  in  global  memory,  while  it  is 

r»1<n  + n£> 

if  both  operands  are  in  local  memory.  Here  r~^  is  the 

asymptotic  performance  rate  for  a one-processor  design  and 

n^  is  the  vector  length  required  to  achieve  half  the 

asymptotic  performance  rate,  assuming  that  local  memory 

accesses  have  much  less  latency  than  global  memory  ones 
L G 

(n^  <<  n^ ) . It  is  obvious  that  the  throughput  can  be 
increased  by  just  including  a sufficient  number  of 
registers  in  the  design.  Several  other  memory  assignment 
schemes  could  also  be  considered  as  candidates. 


CHAPTER  ELEVEN 
CONCLUSIONS 


The  research  conducted  in  the  framework  of  this 
dissertation  advanced  the  body  of  available  knowledge 
about  logarithmic  number  systems  and  arithmetic  processors 
based  on  them. 

Some  new  results  are  reported  with  regard  to  the 
execution  of  arithmetic  operations  in  LNS.  Emphasis  was 
given  to  computation  of  trigonometric  functions. 

The  choice  of  a specific  base  of  logarithms  for  LNS 
was  proven  to  be  immaterial  for  a criterion  based  on  the 
maximum  allowable  error  for  a given  dynamic  range  and 
wordlength . 

The  problem  of  conversion  to  and  from  LNS  was 
considered  and  a full-scale  error  analysis  for  the  FLP  to 
LNS  conversion  was  provided.  This  analysis  served  as  a 
basis  for  a subsequent  stochastic  analysis  of  the 
proposed  advanced  LNS  processor  designs,  which  are 
characterized  by  enhanced  precision.  Two  such  designs,  the 
ARP  and  AMP,  were  examined.  They  were  analyzed  in  terms 
of  error  budget,  operational  latency  and  amount  of 
hardware  involved.  They  were  also  compared  to  each  other. 
The  best  of  the  two  designs  (ARP)  proved  to  offer  a 
precision  enhancement  equivalent  of  up  to  50  percent 


156 


157 


longer  wordlength  when  compared  to  conventional 
contemporary  LNS  designs.  Alone  or  integrated  with  other 
number  systems  (like  the  Signed-Digit  Canonical  system, 
whose  effect  on  the  ARP  performance  was  studied  in  Chapter 
Nine)  the  ARP  processor  was  used  as  a basis  to  form  new 
processor  designs,  based  on  memory  sharing  techniques. 
Because  of  its  modularity,  simplicity  of  hardware  and 
regularity  of  the  data  flow,  it  enabled  the  construction 
of  3-D  systems,  where  multiple  and  regular  paths  are 
opened  in  the  third  dimension.  Such  designs  facilitate 
communications  by  increasing  the  bandwidth  and  offer 
alternative  solutions  to  the  issues  of  fault  tolerance  and 
recove  ry . 

The  theoretical  bound  for  the  operations  count  of 
several  DSP  algorithms  was  targeted  for  the  LNS  processors 
and  some  innovative  techniques  were  offered  to  approach 
it . 

An  applications  development  procedure  was  offered, 
which  by  way  of  summarizing  the  findings  of  this  research, 
aspires  to  be  optimal.  Also,  some  LNS  applications  were 
examined  and  some  exciting  solutions  were  given  to  old 
problems  (like  the  one  of  computing  trigonometric 
functions ) . 

A 20-bit  LNS  VLSI  processor  has  been  made  feasible  as 
a result  of  this  research. 

Still  though,  some  questions  remain  open  requiring 
their  answer  through  future  research.  For  example,  the  new 
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processor  could  form  the  spine  of  a highly  parallel  system 
architecture.  While  at  an  initial  step,  research  should  be 
directed  towards  the  design  of  dedicated  numeric  intensive 
machines  for  such  applications  as  inner-products  and 
matrix  transformations,  it  could  be  later  expanded  to  a 
more  general  purpose  setting.  Since  programs  written  for  a 
Von  Neumann  machine  are  in  general  ill-conditioned  for  a 
parallel  multi-processor  engine,  the  software  needs  to  be 
redesigned,  in  order  to  achieve  high  throughput  and  high 
processor  utilization  factors. 

Of  fundamental  value  will  be  to  determine  precisely 
the  effect  of  adding  cache  memory  to  the  ARP  design,  and 
optimally  specify  the  size  of  this  cache  and  the  I/O 
requirements  it  imposes. 

A determining  factor  to  a successful  performance  of 
the  LNS  engine  developed,  for  either  the  2-D  or  3-D  case 
will  be  the  presence  of  a predictive  compiler,  which  will 
be  capable  of  looking  at  the  code  (past,  present  and 
future)  for  intrinsic  parallelism,  the  processor-memory 
(local  or  globally  shared)  availability,  and  the  status  of 


the  network  switches 

(busy,  idle. 

faulty ) 

imposed 

by 

communication  link 

restrictions , 

and 

produce 

an 

automatically  generated  (AUTOGEN)  code  which  will  maximize 
the  throughput.  Part  of  this  compiler  tasks  will  be  the 
optimization  of  the  memory  trafficing,  the  establishment 
of  precedence  relations  and  the  minimization  of  interrupt 
latency. 
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Although  some  of  the  above  research  tasks  can  be 
undertaken  through  means  of  analytical  mathematical 
models,  several  of  those  will  be  possible  to  carry  out 
only  through  simulation.  The  end  result  would  be  to 
develop  a VLSI  processor  floorplan,  layout  and  simulation 
of  an  ARP  system,  and  if  applicable  fabricate  the  design. 

Using  the  VLSI  LNS  processors,  a wide  range  of 
problems  in  general  purpose  scientific  computing,  DSP, 
robotics  and  so  forth  can  be  positively  impacted.  In 
particular,  the  developed  architectures  could 
significantly  enhance  real  time  adaptive  filtering. 
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APPENDIX  A 

DERIVATION  AND  IMPLEMENTATION  OF  TRIGONOMETRIC 
AND  OTHER  FUNCTIONS  FOR  LNS 


Adopting  the  LNS  representation  for  a real  number  Z = 
Szr  , introduced  in  Chapter  Three,  a variable  Z can  be  de- 
fined as 


Z 


z 


log ( r ) 


rxlr ( e ) 


For  H 


A. 1 Hyperbolic  Trigonometric  Functions 

eX(l  + e-2x) 


cosh(X) 


x -x 

e + e 

2 


h = lr(eX)  + lr(l  + e-2X)  - lr  ( 2 ) 
= z + 1 r ( 1 + r~2z)  - 1 r ( 2 ) 

= z + *(2z)  - 1 r ( 2 ) 


For  M 


sinh(X) 


X -x 

e - e 


(!  - e-x) 


fj  - lr(eX)  + lr[l  - e_2Xj  - lr(2) 
= z + lr ( 1 - r-2z ) - 1 r ( 2 ) 

= z + Y(2z)  - 1 r ( 2 ) 


For  T 


tanh(X) 


sinh(X) 
cosh ( X ) 


M 

H 


r^ 


.fj-h 


T - T(2z)  - *(2z) 
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For  c = rc  = coth(X)  = C?S}M*}  = £ = — = rh  ^ =► 

sinh(X)  M ^ 

c = *(2z)  - Y(2z) 

For  A - ra  - sech(X)  - ■ i - - r‘h 

a = -z  - $(2z)  + lr(2) 

For  V - rv  - csch(x)  - ?IHlTTr  - i - -i  - r-" 

v = -z  - Y(  2z  ) + lr(2) 

The  architecture  for  hardware  implementation  of  the  above 
functions  is  already  given  in  Figure  3.4.  However,  for 
clarity,  the  algorithms  for  cosh(X)  and  coth(X)  are  sum- 
marized below 

cosh(X) 

1)  Produce  z 

2)  Shift  z to  the  left  by  one  bit 

3)  Present  2z  to  the  table  $ and 

4)  Add  z + $(2z)  - 1 r ( 2 ) ->•  h 

coth(X) 

1)  Produce  z 

2)  Shift  z to  the  left  by  one  bit 

3)  Present  2z  to  the  tables  $ and  Y 

4)  Subtract  $(2z)  - Y(2z)  ->  c 

The  regularity  of  the  data  flow  is  obvious. 
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For  C = A x B 

For  C = A 
Without  loss  of 

For  L = A + B 
s <-  a + $ ( b 

For  A = A - B 
S «-  a + Y(b  - 

For  the  special 


A. 2 Other  Functions 


Multiplication 

cab 
r *-  r x r 


c <r  a + b 


-r  B 


Division 


c ra 

r 


c <r  a - b 


generality  we  can  assume  that  A > B.  Then 
Addition 

,b-a> 


s a b a ( , . 

* r = r + r = r ll  + r 


- a)  with  $(b  - a)  = lr^l  + r*5  aj 


Subtraction 

r5  - ra  - rb  - ra 


(i  - .»-•) 
a)  with  Y(b  - a)  = lr^l  - r*3  aj 


case  when  A = B,  then  5 <-  0 . 


APPENDIX  B 

PROGRAM  FOR  SIMULATION  OF  THE  FLP  TO  LNS  ENCODER 


********************************************************* 

* 

* This  program  performs  a simulation  for  the  Error  * 

* Analysis  at  the  output  of  the  Logarithmic  Encoder.  * 

* * 

* Total  is  the  Sample  size  (30000  ?)  * 

* seedl  & seed2  are  random  #s  generator's  initializers.  * 

* The  final  result  is  the  Histogram's  value  divided  by  * 

* (TOTAL  * step)  where  step  = (high  - low)  / HISTO_AXIS  * 

* high  and  low  define  the  dynamic  range  * 

* frac  and  fracl  define  the  fractional  wordlength  * 

* at  the  input  and  output  of  the  log  encoder  table.  * 

* * 

********************************************************** ^ 


#include  <stdio.h> 
tinclude  <math.h> 

#def ine  SIZE  40000 

int  HISTO_AXIS ; 
main( ) 

{ 

extern  int  HISTOAXIS; 
int  total; 

int  i,  low,  high,  seedl,  seed2; 

double  temp,  lowl,  fracl,  frac,  highl; 

double  rad,  k,  Quantl,  A[SIZE],  Quant; 

double  alphafSIZE],  error[SIZE],  y[SIZE],  L[SIZE]; 

double  alphal [ SIZE] , er rorl [ SIZE ] , yl[SIZE],  Ll[SIZE]; 

char  nam[ 20 ] ; 

short  rand(),  nf rom( ) ; 

printf( "Enter  HISTO_AXlS  :0); 

scanf ( "%d" , &HISTO_AXIS ) ; 

printf( "Enter  rad,  total  Quantl  bits  Quant2  bits:0); 
scanf("%f  %d  %f  %f",  &rad,  stotaT,  &frac,  sfracl); 
printf( "Enter  seedl,  seed2  :0); 
scanf ("%d  %d",  sseedl,  &seed2); 
printf ( "rad  - %f  total  - %d  and  frac  = %e0, 
rad , total , frac ) ; 

Quant  = pow(2.,  - frac); 
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Quantl  = pow(2.,  - fracl); 
printf ( "Quant  = %eQuantl  = %eO, 

Quant , Quantl ) ; 

(void)  srand( seedl ) ; 
low  = 5000; 
high  = 10000; 

for  (i  = 1;  i <=  total;  i++){ 

alpha [ i ] = ( double )( nfrom( low, high ) ) /(double)  high; 

L [ i ] = log(alpha[ i ] ) / log(rad); 

} 

(void)  srand( seed2 ) ; 

for  (i  = 1;  i <=  total;  i++){ 

alphalfi]  = ( double )( nf rom( low, high ) ) /(double)  high; 
Ll  [ i ] = log( alphal ( i ] ) / log(rad); 

/*  Histogram  is  performed  below 

To  check  the  uniformity  of  alpha  */ 
lowl  = .5; 
highl  = 1 . ; 

histo(lowl,  highl,  total , "alpha " , alpha); 

/*  Histogram  is  performed  below 

to  check  the  uniformity  of  alpha  */ 
lowl  = log(.5)  / log(rad); 
highl  = 0.; 

histo(lowl,  highl,  total, "L",  L); 

for  (i  = 1;  i <=  total;  i++){ 

alpha! i]  = floor(0.5  + alpha[i]  / Quant)  * Quant; 
y [ i ] - log ( alpha ( i ] ) / log(rad); 

for  (i  - 1;  i <*=  total;  i++){ 

alphal[i]  «=  floor(0.5  + alphal[i)  / Quant)  * Quant; 
yl[i]  * log( alphal [ i ] ) / log(rad); 


/*  Histogram  is  performed  below 
for  alpha*  */ 
lowl  = 0.5; 
highl  = 1 . ; 

histo(lowl,  highl,  total, "alpha  *",  alpha); 

/*  Histogram  is  performed  below 

for  y-lr(a)  */ 

lowl  = log(lowl)  / log(rad); 
highl  = log(highl)  / log(rad); 
histo(lowl,  highl,  total, "y",  y) ; 

for  (i  - lj  i O total;  i++){ 

y [ i ] = f loor ( 0 . 5 + y [ i ] / Quantl)  * Quantl; 
error! i 1 = yl i ] - L[ i ] ; 

} 
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/*  Histogram  is  performed  below 

for  the  discrete  values  of  y */ 

lowl  = lowl  - Quantl  / 2.; 
highl  = highl  + Quantl  / 2.; 
histo(lowl,  highl,  total, "y  y); 

/*  Histogram  is  performed  below 

to  define  the  p.d.f.  of  the  error  */ 
lowl  = log(l.  - Quant)  / log(rad)  - Quantl  / 2.; 
highl  = log(l.  + Quant)  / log(rad)  + Quantl  / 2.; 
histo(lowl,  highl,  total , "error" , error); 


/*  The  function  nfrom  returns  a random  number 

between  low  and  high  inclusive  */ 

short  nfrom(low,  high) 
register  short  low,  high; 

{ 

short  rand ( ) ; 

register  short  nb  = high  - low  + 1; 
return( rand(  ) % nb  + low); 

} 

/*  Histogram  function 

In  the  calling  function  the  following  should  be 
specified  : a)  HISTO_AXIS  : < 104  */ 

histo(low,  high,  total, name,  aray) 

#def ine  CHARSIZE  20 

int  total; 
double  low,  high; 
char  name [ CHARSIZE ] ; 
double  aray[SIZE]; 

{ 

extern  int  HISTO  AXIS; 
int  i,  j,  countfTlO]; 
double  step,  fit,  m; 
char  star,starl; 

FILE  *fp,  *fopen(),  *fclose(); 
fp  = fopen( "outlogrm" , "w" ) ; 
star  = ; 

starl  = 'o'  ; 

printf("  Histogram  for  %s  0,  name); 
for  (j  = 0;  j <=  HISTO  AXIS  + 5;  j++){ 

/*  5 stands  for  a nicer  table  */ 

count [ j ] « 0 ; 

} 

step  = (high  - low)  / HISTO_AXIS; 
printf("step  = %f0,step); 
for  (j  = 1;  j <=  total;  j++){ 

if  (aray[j]  < low  - (abs(low)  / l.e+06)) 

printf("  Low  FIT  for  histogram  %s  = %f0, 
name , aray [ j ] ) ; 
else  if  (aray[j]  > high) 
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printf("  High  fit  for  histogram  %s  - %f0, 
name , aray [ j ] ) ; 

else 

for  (i  = 1;  i <=  HISTO  AXIS;  i++){ 

if  ((aray[j]  < (Tow  + i * step))  && 

(aray(j)  >=  (low  + (i  - 1)  * step))) 
count[i]  = count[i]  +1; 

} 

} 

fit  = 0 . ; 

for  (i  = 1;  i <=  HISTO  AXIS;  i++){ 

fprintf ( fp, "%d%eO , I,  count[i]  / (total  * step)); 
if  (fit  < count [ i ] ) 
fit  = count! i ] ; 

} 

f close ( f p ) ; 

printf ( " Low  - %e  total=  %d  High  = %e  fit=%fO, 
low,  total,  high,  fit); 
for  (i  = 1;  i <=  HISTO  AXIS;  i = i + 5) 

printf ( "count! %dT  =%d%d%d%d%dO , 

i , count [ i ) , 

count! i+1],  count! i+2],  count [ i+3 ], count [ i+4 ]) ; 
for  (i  = 1;  i <=  HISTO  AXIS;  i++){ 
printf ( " %d" , i ) ; 

for  (m  = 1;  m <=  75.  * ( count! i]  / fit);  m++ ) 
pr intf ( " %c" , star ) ; 
print f(n%cO,starl); 

} 


APPENDIX  C 

CODE  FOR  GENERATION  OF  THEORETICAL  CURVE  OF  THE 
p.d.f.  OF  THE  LOG  ENCODER  ERROR.  CASE  i) 


********************************************************* 


* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 


This  program  computes  the  theoretical  p.d.f. 
for  the  ERROR  E at  the  output  of  the 
FLP  to  LNS  Encoder. 

It  passes  its  output  to  the  file  'outfE' 
to  be  used  for  plotting  purposes. 

Accepts  the  base  (radix).  Fractional  wordlength  (frac 
and  the  Total  number  of  points  to  assume  and  computes 
the  values  for  the  p.d.f.  of  E 

It  assumes  that  there  is  one  bit  more  at  tthe  output 
of  the  mapping  memory  table. 


* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 


**********************************************************/ 


#include  <stdio.h> 

#include  <math.h> 

#define  arraysize  1000 
main( ) 

{ 

int  j,  k; 

double  i,  frac,  high,  low,  step,  power,  a[ arraysize ] ; 
double  temp,  rad,  Q,  total,  Ql,  lnr; 
char  star,  starl; 

FILE  *fp,  *fopen(),  *fclose(); 
fp  = fopen( "outfE" , "w" ) ; 
star  = ' ' ; 
starl  = 'o'; 

printf ( "Enter  radix,  frac,  total  :0); 
scanf("%f  %f  %f",  &rad  ,&frac,  stotal); 

Q = pow( 2 . , - frac ) ; 

Ql  = pow(2.,  - (frac  + 1.)); 
lnr  = log( rad) ; 

low  = log(l.  - Q)  / lnr  - Ql  / 2.; 
high  = log(l.  + Q)  / lnr  + Ql  / 2.; 
printf ("  low  ~ %f  high  = %f0,  low,  high); 
step  = (high  - low)  / total; 

j - l; 
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for  (i  = low;  i <=  (high  + step);  i +-=  step){ 

if  ( ( i > log ( 1 . - Q)  / lnr  - Q1  / 2.)  && 

(i  <=  log ( 1 . - Q)  / lnr  + Ql  / 2.)){ 
a [ j ] = (Q  / (4.  * Ql)); 

a [ j ] *=  (1.  / (1.  - pow(rad,  i + Ql  / 2.)) 

- 1.  / Q); 

a [ j ] — ( 1.  / (4.  * Q * Ql))  * 

( pow( rad , i+Ql/2.)+Q-l.); 

} 

else  if  ((i  > log(l.  - Q)  / lnr  + Ql  / 2.) 

&&  (i  <=  log ( 1 . - Q / 2.)  / lnr  - Ql  / 2.)){ 
a [ j ] = Q / ( 4 . * Ql)  ; 

a [ j ] *=  1.  / (1.  - pow( rad , i + Ql  / 2.)) 

- 1.  / (1.  - pow(rad,  i - Ql  / 2.)); 

a [ j ] -=  (1.  / (4.  * Q * Ql))  * 

(pow(rad,  i + Ql  / 2 . ) - pow(rad,  i - Ql  / 2.)); 

} 

else  if  ((i  > log(  1.-  Q / 2.)  / lnr  - Ql  / 2.) 

&&  (i  <=  log ( 1 . - Q / 2.)  / lnr  + Ql  / 2.)){ 

a [ j ] = (Q  / (4.  * Ql))  * (2.  / Q— 1 . / (l.-pow 
(rad,  i - Ql  / 2.)))  - (1.  / (4.  * Q * Ql)) 

* (1.  - Q / 2.  - pow(rad,  i - Ql  / 2.)); 
a [ j ] +=  (3.  / (4.  * Q * Ql))  * 

(pow(rad,  i + Ql  / 2.)  - 1.  + Q / 2.)? 

} 

else  if  ((i  > log(  1.  - Q / 2.)  / lnr  + Ql  / 2.) 

&&  (i  <=  log( 1 . + Q / 2.)  / lnr  - Ql  / 2.)){ 

a [ j ] = (3.  / (4.  * Q * Ql))  * ( pow 

(rad,  i + Ql  / 2.)  - pow(rad,  i - Ql  / 2.)); 

} 

else  if  ((i  > log(  1.  + Q / 2.)  / lnr  - Ql  / 2.) 

&&  (i  <=  log( 1 . + Q / 2.)  / lnr  + Ql  / 2 . ) ) { 

a [ j ] - (3.  / (4.  * Q * Ql))  * ( - pow 

(rad,  i - Ql  / 2.)  + 1.  + Q / 2.); 

a [ j ] +«  (Q  / (4.  * Ql))  * (2.  /Q  + 1.  / (1. 

-pow( rad,  i + Ql  / 2.)))  - (1.  / (4.  * Q * Ql)) 

* (-1.  - Q / 2.  + pow(rad,  i + Ql  / 2.)); 

} 

else  if  ((i  > log(l.  - Q / 2.)  / lnr  + Ql  / 2.) 

&&  (i  <=  log( 1 . + Q)  / lnr  - Ql  / 2.)){ 
a [ j j = (Q  / (4.  * Ql))  * (1.  / (1.  - pow(rad, 

i + Ql  / 2.))  - 1.  / (1.  - pow(rad,  i - Ql  / 

2.)))  - (1.  / (4.  * Q * Ql))  * (pow(rad,  i 
+ Ql  / 2.)  - pow( rad,  i - Ql  / 2.)); 

} 

else  if  ((i  > log(  1.  + Q)  / lnr  - Ql  / 2.) 

&&  (i  <«  log( 1 . + Q)  / lnr  + Ql  / 2.)){ 
a [ j ] - (Q  / (4.  * Ql))  * (-  1.  / Q - 1.  / (1. 

-pow( rad , i - Ql  / 2.)))  - (1.  / (4.  * Q * Ql)) 

* (1.  + Q - pow(rad,  i - Ql  / 2.)); 

} 

printf("  a[%d]  at  %f  - %e0,  j,  i,  a[j]); 

fprintf ( fp, "%d%eO,  j,  a [ j ] ) ; 
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j++; 

} 

fclose(fp); 

temp  = 0; 

for  (j  = 1;  j <=  total;  j = j++) 

if  (a[j]  > temp)  temp  = a[j); 

if  ( temp  >=  75 . ) 

for  (j  = 1;  j <=  total;  j » j++) 

a [ j ] - 75.  * ( a [ j ] / temp) ; 

for  (j  = 1;  j <=  total;  j = j++){ 

printf ( "%d" , j); 
for  (k  = 1;  k <=  a [ j ] ; k++) 
printf ( "%c" , star ) ; 
printf( "%c0,starl) ; 

} 


APPENDIX  D 

CODE  FOR  GENERATION  OF  THEORETICAL  CURVE  OF  THE 
p.d.f.  OF  THE  LOG  ENCODER  ERROR.  CASE  ii) 


********************************************************* 


* 

* This  program  computes  the  theoretical  p.d.f.  * 

* for  the  ERROR  E at  the  output  of  the  * 

* FLP  to  LNS  Encoder.  * 

* Accepts  the  base  (radix),  Fractional  wordlength  (frac)* 

* and  the  Total  number  of  points  to  assume  and  computes  * 

* the  values  for  the  p.d.f.  of  E * 

* It  assumes  that  there  is  an  equal  number  of  bits  * 

* available  at  the  input  and  output  * 

* of  the  mapping  memory  table.  * 

* It  passes  its  output  to  the  file  'outfE-'  * 

* to  be  used  for  plotting  purposes.  * 

* * 

**********************************************************/ 


#include  <stdio.h> 

#include  <math.h> 

♦define  arraysize  1000 

double  lg( ) , Q,  Q1 , int_A( ) , int_B(),  rad; 
main( ) 

{ 

int  j , k ; 

double  i,  frac,  high,  low,  step,  power,  a[ arraysize ] ; 
double  temp,  total; 
char  star,  starl; 

FILE  *fp,  *fopen(),  *fclose(); 
fp  = fopen( "outf E=" , "w" ) ; 
star  = ' ' ; 
starl  = 'o'; 

printf ( "Enter  radix,  frac,  total  :0); 
scanf("%f  %f  %f",  &rad  ,&frac,  stotal); 

Q = pow( 2 . , - frac ) ; 

Q1  - Q; 

low  « lg(l.  - Q)  - Q1  / 2. ; 

high  - lg(l.  + Q)  + Ql  / 2.; 

printf ("  low  = %f  high  = %f0,  low,  high); 

step  - (high  - low)  / total; 
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j “ i; 

for  (i  = low;  i <=  (high  + step);  i +=  step){ 
if  ((i  > lg(l.  - Q)  - Q1  / 2.)  && 

(i  <-  lg(l.  - Q / 2. ) - Ql  / 2. ) ){ 

a [ j ] «=  (1.  / Ql)  * ( i n t_A ( i + Ql  / 2.,  lg(l.  - Q))); 

else  if  ( ( i > lg(l.  - Q / 2.)  - Ql  / 2.) 

&&  (i  <-  lg(l.  - Q)  + Ql  / 2 . ) ) { 
a [ j ] = (1.  / Ql)  * (int  A(lg(l.  - Q / 2.),  lg 
(1.  - Q))  + int  B ( i + Ql  / 2.,  lg ( 1 - Q / 2.))); 

} 

else  if  ( ( i > lg(l.  - Q)  + Ql  / 2.) 

&&  (i  <=  lg( 1 . - Q / 2.)  + Ql  / 2 . ) ) { 
a [ j ] = (1.  / Ql)  * ( i n t_A ( 1 g ( 1 . - Q / 2.),  i - 
Ql  / 2.)  + int  B ( i + Ql  / 2.,  lg ( 1 . - Q / 2.))); 

} 

else  if  ((i  > 1 g ( 1.  - Q / 2.)  + Ql  / 2.) 

&&  (i  <=  lg( 1 . + Q / 2. ) - Ql  / 2. ) ) { 

a ( j ] - (1.  / Ql)  * 

int  B ( i + Ql  / 2.,  i - Ql  / 2.); 

} 

else  if  ((i  > lg(  1.  + Q / 2.)  - Ql  / 2.) 

&&  (i  <-  lg(l.  + Q)  - Ql  / 2. ) ) { 
a [ j ] - (1.  / Ql ) * ( int_B ( lg ( 1 . + Q / 2.),  i - 
Ql  / 2.)  + int_A( i + Ql  / 2 . , lg(l.  + Q / 2.))); 

else  if  ((i  > lg(l.  + Q)  - Ql  / 2.) 

&&  (i  <=  lg(l.  + Q / 2.)  + Ql  / 2 . ) ) { 
a [ j ] - (1.  / Ql)  * ( int_B( lg( 1 . + Q / 2.),  i - 
Ql  / 2.)  + int_A( lg( 1 . + Q),  lg(l.  + Q / 2.))); 

else  if  ((i  > lg ( 1.  + Q / 2.)  + Ql  / 2.) 

&&  (i  <=  lg(l.  + Q)  + Ql  / 2. ) ) { 

at  j ] = (1.  / Ql ) *int  A(lg(l.  + Q),  i - Ql  / 2.); 

} 

pr intf ( " a[%d]  at  %f  = %e0,  j,  i,  a [ j ] ) ; 
fprintf ( fp, "%d%eO , j,  a [ j ] ) ; 

j++; 

} 

f close ( fp ) ; 

temp  = 0; 

for  (j  « 1;  j <-  total;  j - j++) 

if  ( a [ j ] > temp) 
temp  - a [ j ] ; 

if  ( temp  >=  75. ) 

for  (j  = 1;  j <=  total;  j = j++) 

a [ j ] = 75.  * ( a [ j ] / temp); 

for  (j  - 1;  j <-  total;  j = j++){ 

printf ( " %d" , j ) ; 
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for  (k  *=  1;  k <=  a [ j ] ; k++) 
printf( "%c" , star ) ; 

print f ( " %c0 , starl ) ; 

} 

} 

double  lg(arg) 
double  arg; 

{ 

return ( log ( arg ) / log(rad)); 

} 

double  int_A(up,  down) 
double  up,  down; 

{ 

double  temp; 
if  (up  > down)  { 

temp  - (Q  / 4.)  * (1.  / (1.  - pow(rad,  up)) 

1.  / (1.  - pow(rad,  down))); 

temp  — (1.  / (4.  * Q))  * (pow(rad,  up) 

- pow(rad,  down)); 

} 

else 

temp  **  0 . ; 
return( temp) ; 

} 

double  int_B(up,  down) 
double  up,  down; 

{ 

double  temp; 
if  (up  > down)  { 

temp  »=  (3.  / (4.  * Q))  * (pow(rad,  up) 

- pow(rad,  down)); 

} 

else 

temp  * 0 . ; 
return( temp) ; 


APPENDIX  E 

CODE  FOR  GENERATION  OF  THEORETICAL  CURVE  OF  THE 
p.d.f.  OF  THE  LOG  ENCODER  ERROR.  CASE  iii) 


********************************************************* 


it 

* This  program  computes  the  theoretical  p.d.f.  * 

* for  the  ERROR  E at  the  output  of  the  * 

* FLP  to  LNS  Encoder.  * 

* Accepts  the  base  (radix),  Fractional  wordlength  (frac)* 

* and  the  Total  number  of  points  to  assume  and  computes  * 

* the  values  for  the  p.d.f.  of  E for  the  second  case  * 

* (when  Qo  «=  Qi  + 1 ) * 

* It  assumes  that  there  is  an  equal  number  of  bits  * 

* available  at  the  input  and  output  * 

* of  the  mapping  memory  table.  * 

* It  passes  its  output  to  the  file  'outfE_sec'  * 

* to  be  used  for  plotting  purposes.  — * 

* * 


**********************************************************  y 


#include  <stdio.h> 

♦include  <math.h> 

♦define  arraysize  1000 

double  lg( ) , Q,  Ql , int_A( ) , int_B(),  rad; 
main( ) 

{ 

int  j,  k; 

double  i,  frac,  high,  low,  step,  power,  a[ arraysize ] ; 
double  temp,  total; 
char  star,  starl; 

FILE  *fp,  *fopen(),  *fclose(); 
fp  = f open( "outf E_sec" , "w" ) ; 
star  = ' ' ; 
starl  = 'o'; 

printf( "Enter  radix,  frac,  total  :0); 
scanf("%f  %f  %f",  &rad  ,&frac,  stotal); 

Q « pow( 2 . , - frac ) ; 

Ql  = pow(2.,  - frac  + 1.); 
low  = lg(l.  - Q)  - Ql  / 2. ; 
high  = lg(l.  + Q)  + Ql  / 2 . ; 
step  = (high  - low)  / total; 


183 


184 


j = i; 

for  (i  = low;  i <=  (high  + step);  i +=  step){ 
if  ((i  > lg ( 1 . - Q)  - Q1  / 2.)  && 

(i  <=  lg ( 1 . - Q / 2.)  - Q1  / 2.)){ 
a [ j ] = (1.  /Ql)  * 

(int  A ( i + Ql  / 2.,  lg ( 1 . - Q))); 

} 

else  if  ((i  > lg(l.  - Q / 2.)  - Ql  / 2.) 

&&  (i  O 1 g ( 1 . - Q)  + Ql  / 2.)){ 
a [ j ] = (1.  / Ql)  * (int  A(lg(l.  - Q / 2.),  lg 
(1.  - Q))  + int_B(i  + Ql  / 2.,  lg(l  - Q / 2.))); 

else  if  ((i  > lg ( 1 . - Q)  + Ql  / 2.) 

&&  (i  <=  lg ( 1 . + Q / 2.)  - Ql  / 2 . ) ) { 

a [ j ] = (1.  / Ql)  * ( i n t_A ( 1 g ( 1 . - Q / 2.),  i - 
Ql  / 2.)  + int_B ( i + Ql  / 2 . , lg(l.  - Q / 2.))); 

else  if  ((i  > lg(  1.  + Q / 2.)  - Ql  / 2.) 

&&  (i  <-  lg ( 1 . - Q / 2.)  + Ql  / 2 . ) ) { 
a ( j ) - (1.  / Ql)  * ( int_A( lg ( 1 . - Q / 2.),  i - 
Ql  / 2.)  + int_B ( lg ( 1 . + Q / 2.),  lg(l.  - Q / 
2.))  + int_A( i + Ql  / 2.,  lg(l.  + Q / 2.))); 

else  if  ( ( i > lg(  1.  - Q / 2.)  + Ql  / 2.) 

&&  (i  <=  lg ( 1 . + Q)  - Ql  / 2 . ) ) { 

a [ j ] - (1.  / Ql)  * ( int_B ( lg ( 1 . + Q / 2.),  i - 
Ql  / 2.)  + int_A( i + Ql  / 2.,  lg(l.  + Q / 2.))); 

else  if  ( ( i > lg ( 1 . + Q)  - Ql  / 2.) 

&&  (i  <«  lg ( 1 . + Q / 2.)  + Ql  / 2 . ) ) { 

a [ j ] - (1.  / Ql)  * ( int_B( lg( 1 . + Q / 2.),  i - 
Ql  / 2.)  + int_A( lg ( 1 . + Q),  lg(l.  + Q / 2.))); 

else  if  ((i  > lg(  1.  + Q / 2.)  + Ql  / 2.) 

&&  (i  <=  lg ( 1 . + Q)  + Ql  / 2.)){ 

a [ j ] = (1.  /Ql)  * 

int  A( lg ( 1 . + Q),  i - Ql  / 2.); 

} 

printf("  a [ %d ] at  %f  = %e0,  j,  i,  a [ j ] ) ; 

fprintf ( fp, "%d%eO,  j,  a [ j ] ) ; 

j++; 

} 

f close ( fp ) ; 

temp  - 0 ; 

for  (j  = 1;  j <=  total;  j = j++) 

if  ( a [ j ] > temp) 
temp  - a[ j ] ; 

i f ( temp  >«  75 . ) 

for  ( j - 1;  j <=  total;  j = j++) 

atj]  - 75.  * ( a [ j ] / temp); 
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for  (j  = 1;  j <=  total;  j = j++){ 

pr intf ( " %d" , j ) ; 

for  (k  = 1;  k <=  a [ j ] ; k++) 
pr intf (" %c" , star ) ; 

print f("%cO,starl); 


} 

double  lg(arg) 
double  arg; 

{ 

return( log ( arg ) / log(rad)); 

} 

double  int_A(up,  down) 
double  up,  down; 

{ 

double  temp; 
if  (up  > down)  { 

temp  = (Q  / 4.)  * (1.  / (1.  - pow(rad,  up)) 
1.  / (1.  - pow(rad,  down))); 
temp  — (1.  / (4.  * Q))  * (pow(rad,  up) 

- pow( rad,  down) ) ; 

} 

else 

temp  = 0 . ; 
return( temp ) ; 


double  int_B(up,  down) 
double  up,  down; 

{ 

double  temp; 
if  (up  > down) 

temp  = (3.  / (4.  * Q))  * (pow(rad,  up) 
- pow( rad , down ) ) ; 

else 

temp  = 0 . ; 
return) temp) ; 


APPENDIX  F 

CODE  FOR  GENERATION  OF  APPROXIMATION  CURVE  OF  THE 
p.d.f.  OF  THE  LOG  ENCODER  ERROR 


********************************************************* 

* 

* This  program  computes  the  Theoretical  * 

* Approximation  of  the  theoretical  p.d.f.  for  the  * 

* ERROR  E at  the  output  of  the  FLP  to  LNS  Encoder.  * 

* * 

* Accepts  the  base  (radix),  Fractional  wordlength  (frac)* 

* and  the  Total  number  of  points  to  assume  and  computes  * 

* the  values  for  the  p.d.f.  of  E for  the  second  case  * 

* ( when  Qo  = Qi  + 1 ) . * 

* * 

* It  assumes  that  there  is  an  equal  number  of  bits  * 

* available  at  the  input  and  output  * 

* of  the  mapping  memory  table.  * 

* It  passes  its  output  to  the  file  'outappr'  * 

* to  be  used  for  plotting  purposes.  * 

* * 


********************************************************** y 


♦include  <stdio.h> 

♦include  <math.h> 

♦define  arraysize  1000 
main( ) 

{ 

int  j,  k; 

double  i,  rad,  frac,  high,  low,  power,  a[ arraysize ] ; 
double  step,  temp,  Q,  Lamda,  Kappa,  xlK; 
double  x2K , xlL,  x2L,  total,  Ql , lnr; 
char  star,  starl; 

FILE  *fp,  *fopen(),  *fclose(); 
fp  = fopen( "outappr" , "w" ) ; 
star  = ' ' ; 
starl  = 'o'; 

printf ( "Enter  radix,  frac,  total  :0); 
scanf("%f  %f  %f",  &rad  ,&frac,  stotal); 

Q = pow( 2 . , - frac ) ; 

Ql  = pow(2.,  - (frac  + 1.)); 
lnr  = log( rad) ; 
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low  = log(l.  - Q)  / lnr  - Ql  / 2.; 

high  = log ( 1 . + Q)  / lnr  + Ql  / 2.; 

p r i n t f ( " low  = %f  high  - %f0,  low,  high); 

Kappa  = (3.  * (1.  - pow( rad , -Ql))  * (1.  +Q/2.))  / 
(4.  * Q * Ql  * pow(rad,  qi  / 2.)); 

Lamda  = (3.  * (1.  -Q/2.)  * 

( pow( rad , Ql ) - 1.))  / (4.  * Q * Ql ) ; 
xlL  = low; 

x2L  = log(l.  - Q / 2.)  / lnr  + Ql  / 2.; 

x2K  = log(l.  + Q / 2.)  / lnr  - Ql  / 2.; 

xlK  = high; 

step  = (high  - low)  / total; 

j - l; 

for  (i  = low;  i <=  (high  + step);  i +«  step){ 
if  ( ( i > low)  && 

(i  <«  log(l.  - Q / 2.)  / lnr  + Ql  / 2.)){ 

a [ j ] = (i  - xlL ) * Lamda  / (x2L  - xlL); 

} 

else  if  ((i  > log(  1.  - Q / 2.)  / lnr  + Ql  / 2.) 

&&  (i  <-  log(l.  + Q / 2.)  / lnr  - Ql  / 2.)){ 
a[j]  = (3.  / (4.  * Q * Ql))  * 

( pow( rad , i + Ql  / 2 . ) - pow(rad,  i - Ql  / 2 . ) ) 

else  if  ((i  > log(  1.  + Q / 2.)  / lnr  - Ql  / 2.) 

&&  ( i <=  high ) ) { 

a[j]  - (i  - xlK ) * Kappa  / (x2K  - xlK); 
fprintf ( fp, "%d%eO , j,  a [ j ] ) ; 

j++; 

} 

f close ( f p) ; 
temp  = 0; 

for  (j  = 1;  j <=  total;  j = j++){ 

if  ( a ( j ] > temp) 
temp  = a [ j ] ; 

} 

i f ( temp  >=  75 . ) 

for  (j  = 1;  j <=  total;  j = j++) 

a I j ] = 75.  * ( a [ j ) / temp); 
for  (j  «=  1;  j o total;  j - j++){ 

printf("%d",  j); 
for  (k  = 1;  k <=  a [ j ] ; k++){ 
printf ( "%c" ,star ) ; 

} 

printf (" %c0 , starl ) ; 

} 

} 


APPENDIX  G 

CODE  FOR  GENERATION  OF  THE  ESSENTIAL  ZEROS  ENTRIES 

OF  TABLE  5.1 


ilr****************************************************^^ 


* 

This  program  computes  the  Essential  Zeros  * 
for  the  addition  and  subtraction  mappings  * 
if  the  base  and  the  number  of  fractional  * 
bits  of  the  LNS  are  known.  * 


★A********************************************************/ 


#include  <stdio.h> 

#include  <math.h> 
double  M,  radix; 
main( ) 

{ 

double  v( ) ; 

int  c ; 

extern  double  M , radix; 


} 


while  ( ( c = getchar())  !=  EOF)  { 

printf ( "enter  radix  r : , M :0); 
scanf ( " %f  %f",  & radix , &M); 
printf ("zero  = %fO,v()); 
c = getchar ( ) ; 

} 


double  v( ) /*  Computes  the  essential  zero  for 

given  # of  fractional  bits  M, 
and  given  radix  r.  */ 

extern  double  M,  radix; 
double  V; 

V - - log( pow( radix , pow( 2 . M) ) - 1.)  / log(radix); 
return  ( V) ; 
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APPENDIX  H 

CODE  FOR  GENERATION  OF  THEORETICAL  CURVE  OF  THE 
ERROR  p.d.f.  OF  THE  LOGARITHMIC  ADDER 


/********************************************************* 


* * 

* This  program  computes  the  theoretical  p.d.f.  * 

* for  the  ERROR  E at  the  output  of  the  * 

* Logarithmic  Adder  for  a specific  address  v.  * 

* Accepts  the  base  (radix).  Fractional  wordlength  (frac)* 

* at  the  output  of  the  memory  table  * 

* lex  is  the  number  of  integer  (only)  bits  that  the  LNS  * 

* exponent  x is  using.  * 

* v determines  the  table  address  * 

* It  passes  its  output  to  the  file  'outdensity'  * 

* to  be  used  for  plotting  purposes.  * 

* * 


**********************************************************/ 


#include  <math.h> 
tinclude  <stdio.h> 

#define  SIZE  1000 
double  radix,  lnr,  Q4,  QL; 
main( ) 

{ 

extern  double  radix,  lnr,  Q4,  QL; 

double  temp,  density [ SIZE ] , al,  L,  a2,  bl,  b2,  cl,  c2; 
double  EL ( ) , EL1(),  EL2(),  efl(),  ef2(),  ef3(); 
char  star,  starl; 
int  eps,  k; 

double  lex,  suma,  step,  high,  sumb,  frac; 
int  condition; 

double  suml , sum2 , ddl , dd2 , v,  sum3 , deltal,  delta2; 
double  i,  sumc,  j,  Probl,  Prob2,  Prob3,  dd3,  1,  m; 
double  Imean(),  total,  low.  Zero,  Ivar(),  Prob(),  sum; 
FILE  *fp,  *fopen(),  *fclose(); 
fp  « fopen( "outdensity" , "w" ) ; 

printf("  frac,  radix,  v,  lex  ,L,  totalO); 
scanf ( "%f  %f  %f  %f  %f  %f", 

&f rac,&radix,&v,  &Iex,  &L,  stotal); 
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printf("frac  = %f,radix=%f,  v=%f,  lex  = %f,  L = %f0, 
frac,  radix,  v,  lex,  L); 

lnr  = log( radix ) ; 

Q4  = pow(2.,  - frac);  /*  In  theory  this  is  Qo  */ 

QL  - pow(2.,  - (frac  - lex  + L)); 

printf ( "Q4  / 2 = %f,  QL  / 2 = % f 0 , Q4/2 . , QL/2 . ) ; 

star  = ' ' ; 

starl  = 'o'; 

al  = EL ( v ) - EL2 ( v ) - Q4  / 2.; 
c2  = EL ( v ) - ELI ( v ) + Q4  / 2.; 

if  (Q4  <=  ( EL2 ( v ) - ELI ( v ) ) ) { 

a2  = EL ( v ) - EL2 ( v ) + Q4  / 2.; 
b2  - EL ( V ) - ELI ( V ) - Q4  / 2.; 
condition  *•  1; 

printf (" ********  condition  = 1 *******0); 


else  if  (Q4  > (EL2(v)  - ELl(v)))  { 
a2  « EL ( v ) - ELI ( v ) - Q4  / 2.; 
b2  = EL ( v ) - EL2 ( v ) + Q4  / 2.; 
condition  = 2; 

pr intf ( " *** *****  condition  * 2 *******0); 


bl  = a2 ; 
cl  - b2 ; 
low  = al; 
high  = c2; 

step  = (high  - low)  / total; 
eps  = 1; 

for  (i  = low;  i <=  high;  i +«  step){ 
if  ( ( i >=  al ) &&  ( i < a2 ) ) 

density[eps]  = efl(v,  i)  * (1.  / (Q4  * QL)); 
else  if  ((i  >=  bl ) &&  (i  < b2 ) &&  (condition  ««  1)) 
density[eps]  = ef2(v,  i)  * (1.  / (Q4  * QL ) ) ; 
else  if  ((i  >=  bl ) &&  (i  < b2 ) &&  (condition  «=  2)) 
density[eps]  * 1 . / Q4; 
else  if  ((i  >=  cl)  &&  (i  < c2)) 

density[eps]  - ef3(v,  i)  * (1.  / (Q4  * QL)); 
else 

printf("  density[%d]  at  %f  = %e0, 
eps,  i,  densi ty [ eps ] ) ; 

if  ( density[ eps ] <=  0.0) 
density[eps]  = 0.0; 

fprintf ( fp, "%d%e0 , eps,  density [ eps ]) ; 
eps++ ; 
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f close ( f p ) ; 

temp  = 0 . ; 

for  (eps  = 1;  eps  <=  total;  eps++) 

if  ( density [ eps ] > temp) 
temp  - density [ eps ] ; 

i f ( temp  >=  70 . ) { 

temp  = temp  / 70 . ; 

for  (eps  = 1;  eps  <=  total;  eps++) 

density [eps]  /=  temp  + 1.0; 

} 

for  (eps  = 1;  eps  <=  total;  eps++){ 

pr intf ( " %d" , eps ) ; 

for  (k  - 1;  k <=  ( density [ eps ]) ; k++) 

printf ( " %c" , star ) ; 

printf( "%c0,starl) ; 


} 


} 

double  efl(delta,  chi) 
double  delta,  chi; 

{ 

extern  double  radix,  lnr,  QL; 
double  Gammal(),  res; 

res  = - delta  + QL/2 . - log( Gammal ( delta )/ 
pow(radix,  chi)-l.)  / lnr; 
return( res ) ; 


} 

double  ef2(delta,  chi) 
double  delta,  chi; 


extern  double  radix,  lnr; 
double  Gammal(),  res,  Gamma2(); 

res  = ( log( Gamma2 ( delta ) / pow(radix,  chi)  - 1.)  - 

log( Gammal ( delta ) / pow(radix,  chi)  - 1.))  / lnr; 
return( res ) ; 


} 

double  ef3(delta,  chi) 
double  delta,  chi; 
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{ 


extern  double  radix, 
double  res,  Gamma2(); 
res  =»  delta  + QL  / 2 . 

return( res ) ; 


QL,  lnr; 

+ log(Gamma2(delta) 
/ pow( radix,  chi)  - 


1.  ) 


} 

double  Gammal ( delta ) 
double  delta; 

{ 


extern  double  radix,  Q4; 
double  res; 

res  = (1.  + pow(radix,  - delta))  * 
pow( radix,  - Q4  / 2.); 
return( res ) ; 


} 

double  Gamma2 ( delta ) 
double  delta; 

{ 


extern  double  radix,  Q4 ; 
double  res; 

res  - (1.  + pow(radix,  - delta))  * pow(radix, 
return( res ) ; 


} 


double  ELl(delta) 
double  delta; 

{ 


extern  double  radix,  lnr,  QL; 
double  res; 

res  = log(l.  + pow( radix,  - (delta  + QL  / 2 . ) 
return( res ) ; 


} 

double  EL2( delta) 
double  delta; 

{ 


extern  double  radix,  lnr,  QL; 
double  res; 

res  - log(l.  + pow(radix,  - 
return( res ) ; 


/ lnr; 


Q4  / 2. ) ; 


))  / lnr; 


(delta  - QL  / 2.)))  / lnr; 
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} 

double  EL(delta) 
double  delta; 

{ 


extern  double  radix; 
double  res; 

res  = log(l.  + pow(radix,  - delta))  / lnr 
return (res); 


} 


APPENDIX  I 

CODE  FOR  GENERATION  OF  APPROXIMATION  CURVE  OF  THE 
p.d.f.  OF  THE  ERROR  AT  THE  OUTPUT  OF  THE  LOGARITHMIC  ADDER 


/********************************************************* 


* This  program  computes  the  approximated  p.d.f.  * 

* for  the  ERROR  E at  the  output  of  the  * 

* Logarithmic  Adder  for  a specific  address  v.  * 

* Accepts  the  base  (radix),  Fractional  wordlength  (frac)* 

* at  the  output  of  the  memory  table  * 

* lex  is  the  number  of  integer  (only)  bits  that  the  LNS  * 

* exponent  x is  using.  * 

* v determines  the  table  address  * 

* L is  the  number  of  shifts  of  the  address  performed  * 

* It  passes  its  output  to  the  file  'outdenappr'  * 

* to  be  used  for  plotting  purposes.  * 

* * 


********************************************************** / 


#include  <math.h> 

#include  <stdio.h> 

#def ine  SIZE  1000 
double  radix,  lnr,  Q4,  QL; 
main( ) 

{ 

extern  double  radix,  lnr,  Q4,  QL; 

double  temp,  density! SIZE ] , al,  L,  a2,  bl , b2 , cl,  c2; 
double  top,  EL ( ) , EL1(),  EL2(); 
char  star,  starl; 
int  eps,  k; 

double  lex,  Probl,  Prob2,  Prob3,  total,  high,  frac; 
int  condition; 

double  suml , sum2 , ddl , dd2 , v,  sum3 , deltal,  delta2; 
double  i,  suma,  sumb,  sumc,  j,  dd3,  1,  m; 
double  Imean( ) , step,  low,  Zero,  Ivar(),  Prob(),  sum; 
FILE  *fp,  *fopen(),  *f close ( ) ; 
fp  = fopen( "outdenappr " , "w" ) ; 

printf ( " frac,  radix,  v,  lex  ,L,  totalO); 
scanf ( " %f  %f  %f  %f  %f  %f", 
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&frac,&radix,&v,  &lex,  &L,  &total); 

printf ( "frac=%f, radix=%f,  v - %f,  lex  - %f,  L « %fO 
frac,  radix,  v,  lex,  L); 


lnr  = log( radix ) ; 

Q4  = pow(2.,  - frac); 

QL  = pow(2.,  - (frac  - lex  + L ) ) ; 


printf ( "Q4  / 2 
Q4  / 2 


= %f,  QL  / 2 
QL  / 2 . ) ; 


%fO, 


star  = ' ' ; 

starl  = 'o'; 

top  = EL2 ( v ) - ELI ( v ) ; 
al  = EL ( v ) - EL2 ( v ) - Q4  / 2.; 
c2  = EL ( v ) - ELI ( v ) + Q4  / 2.; 
if  (Q4  <=  top)  { 

a2  = EL ( v ) - EL2 ( v ) + Q4  / 2.; 

b2  = EL ( v ) - ELI ( v ) - Q4  / 2.; 

condition  = 1; 

printf ( "********  condition  = 1 *******0); 

} 

else  if  (Q4  > top)  { 

a2  - EL ( v ) - ELI  ( v ) - Q4  / 2.; 

b2  = EL ( v ) - EL2 ( v ) + Q4  / 2.; 

condition  = 2; 

pr intf ("****** **  condition  = 2 *******0); 

} 

bl  = a2; 
cl  = b2 ; 
low  = al; 
high  = c2; 

step  = (high  - low)  / total; 


printf("al  = 

%f 

a2  = %f0, 

al, 

a2 ) ; 

printf ("bl  = 

%f 

b2  = %f0, 

bl, 

b2  ) ; 

printf ( "cl  = 

%f 

c2  = %f0, 

cl , 

c2  ) ; 

eps  = 1; 

if  (condition 

= ss 

D{ 

for  (i  = 

low; 

i <=  high 

; i 

+=  step ) { 

if  ( ( i >=  al ) &&  ( i < a2 ) ) { 

density[eps]  = (i  - al)  / (top  * Q4 ) ; 
printf("efll  at  %d  = %e0 , eps , density [ eps ] ) 
} 

else  if  ((i  >-  bl ) &&  (i  < b2 ) ) { 

density[eps]  = 1.  / top; 

printf("efl2  at  %d  - %e0 , eps , density [ eps ) ) 
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else  if  ((i  >=  cl)  &&  (i  < c2))  { 

density! eps ) = - (i  - c2)  / (top  * Q4); 
printf  ( "ef  1 3 at  %d  *=  %e0  , eps  , densi  ty  [ eps  ] ) ; 

fprintf(fp,  "%d%fO,  eps,  density [ eps ]) ; 

eps++ ; 


} 

} 

else  if  (condition  ==  2){ 

for  (i  = low;  i <=  high;  i +=  step){ 
i f ( ( i >=  al ) &&  ( i < a2 ) ) { 

density! eps ] = (i  - al ) / (Q4  * top); 
printf("ef21  at  %d  *=  %e0 , eps , density [ eps ]) ; 

else  if  ((i  >=  bl ) &&  (i  < b2 ) ) { 

densityleps]  - 1.  / Q4; 

printf("ef22  at  %d  - %e0 , eps , density [ eps )) ; 

else  if  ( ( i >=  cl)  &&  (i  < c2))  { 

density! eps ] = - (i  - c2)  / (Q4  * top); 
printf("ef23  at  %d  = %e0 , eps , densi ty[ eps ]) ; 

fprintf(fp,  "%d%fO,  eps,  density t eps ])  ; 

eps++ ; 

} 

} 

f close ( fp ) ; 
temp  = 0 . ; 

for  (eps  = 1;  eps  <=  total;  eps++) 

if  ( density [ eps ] > temp) 
temp  = densityleps]; 

if  ( temp  >=  70 . ) { 

temp  = temp  / 70 . ; 

for  (eps  = 1;  eps  <=  total;  eps++) 

densityleps]  /=  temp  + 1.0; 


for  (eps  - 1;  eps  <=  total;  eps++){ 

printf ("%d",  eps); 

for  (k  = 1;  k <=  (densityleps]);  k++) 
printf("%c",star); 
printf( "%c0,starl) ; 


} 


APPENDIX  J 

SIMULATION  OF  A LOGARITHMIC  ADDER  DETERMINING 
AN  EXPERIMENTAL  ERROR  p.d.f. 


/*  ******************************************************** 


* This  program  : ' Adder_sim. c ' performs  a simulation  * 

* for  the  Error  analysis  at  the  output  of  the  * 

* Logarithmic  Adder.  Passes  its  output  to  the  file  * 

* 'outadder'  to  be  used  for  plotting.  * 

* ★ 

* The  parameters  are:  * 

* * 

* Total  is  the  number  of  random  #s  (10000);  * 

* seedl  & seed2  are  the  random  #s  generator's  * 

* initializers.  The  final  result  is  the  'Histogram  * 

* value  divided  by  (TOTAL  * step)  where  * 

* step  = (high  - low)  / HlSTO_AXIS,  * 

* (Used  in  'histo'  function).  * 

* frac  are  the  fractional  bits  of  the  output  and  * 

* lex  are  the  integer  bits  that  ex  is  using.  * 

* v is  the  # for  which  the  histogram  is  performed.  * 

* * 


********************************************************** y 


#include  <stdio.h> 
tinclude  <math.h> 

#def ine  SIZE  40000 

int  HISTO  AXIS; 

double  radix,  lnr,  Q4,  QL; 

main( ) 

{ 


extern  int  HISTO  AXIS; 
extern  double  radix,  lnr,  Q4,  QL; 
double  EL1( ) , EL2( ) ; 
int  total; 

int  i,  low,  high,  Ie,  L,  seedl,  seed2; 
double  temp,  K,  v,  lowl,  frac,  highl; 
double  E4 ( SIZE ] , EL[ SIZE ] , E[SIZE]; 
char  nam[20]; 
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short  rand(),  nfrom(); 

printf ( "Enter  HISTO_AXIS  :0); 
scant ( " %d" , &HISTO  AXIS); 

printf ( "Enter  radix,  total,  frac,  v,  Ie,  L :0); 
scanf ( "%f  %d  %f  %f  %d  %d", 

&radix,  &total,  &frac,  &v,  &Ie,  &L); 
printf ( "Enter  seedl,  seed2  :0); 
scanf ("%d  %d",  &seedl,  &seed2); 

lnr  = log ( radix ) ; 

K = log(l.  + pow(radix,  - v))  / lnr; 
printf (" radix  = %f  total  = %d  and  frac  « %f0, 
radix,  total,  frac); 

printf ("v  = %f  Ie  = %d  K = %f  and  L = %d0, 
v,  I e , K,  L ) ; 

Q4  = pow(2.,  - frac); 

QL  = pow(2.,  - (frac  - Ie  + L)); 

printf ("Q4  - %e  QL  = %e0,  Q4,  QL); 

(void)  srand( seedl ) ; 
low  = 0 ; 
high  = 30000; 

for  (i  «=  1;  i <=  total;  i++) 

E4(i]  - (( double )( nfrom( low, high ) ) * 

(Q4  / high) ) - (Q4  / 2. ) ; 

(void)  srand( seed2 ) ; 

for  (i  = 1;  i <~  total;  i++) 

EL [ i ] = (( double )( nfrom( low, high ) ) * 

(QL  / high) ) - (QL  / 2. ) ; 

/*  Histogram  for  the  Output  Error  */ 

lowl  = - Q4  / 2 . ; 
highl  = Q4  / 2 . ; 

histo(lowl,  highl,  total, "E4",  E4 ) ; 

/*  Histogram  for  the  Error  EL  */ 

lowl  *=  - QL  / 2 . ; 
highl  = QL  / 2 . ; 

histo(lowl,  highl,  total, "EL",  EL); 

for  (i  = 1;  i <=  total;  i++) 

E[i]  = K - log(l.  + pow(radix,  - 

(v  + EL[ i 1 ) ) ) / lnr  - E4[ i ] ; 

/*  Histogram  for  the  final  Error  E */ 
lowl  - K - EL2 ( v ) - Q4  / 2.; 
highl  - K - ELl(v)  + Q4  / 2.; 
hi sto ( lowl , highl,  total, "E",  E); 


} 


APPENDIX  K 

CODE  USED  TO  GENERATE  THE  ENTRIES  OF  TABLE  6,1 


********************************************************* 


* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 


* 

Computation  of  approximation  of  * 

ERROR  MEAN  and  VARIANCE  * 

for  various  ranges  of  delta.  * 

The  computation  is  based  on  calculating  the  mean  * 

and  the  variance  of  the  output  error  * 

(due  to  quantization  errors  - and  shifting  - * 

at  the  input)  for  a specific  delta,  and  then  * 

averaging  this  variance  by  integrating  the  variance  * 
over  a range  of  delta's,  and  dividing  the  value  of  * 
the  integral  by  the  range  length.  * 

* 

frac  are  the  fractional  bits  of  the  output  and  * 

deltacO  determines  the  "essential  zero"  * 

lex  are  the  integer  bits  that  x is  using.  * 

* 


* The  experiment  is  performed  for  two  and  for  * 

* three  levels  of  partitioning  and  no  partitioning  * 

* at  all.  * 

* program  : Breakpoint. c * 

* * 


**********************************************************  y 


#include  <math.h> 

#include  <stdio.h> 
double  lnr,  Q4,  QL,  radix; 
main( ) 

{ 


double 
double 
double 
double 
pr intf ( 
scanf ( " 


lex,  delta3,  sum3,  deltal,  Probl,  Prob2,  frac 
suml , sum2,  ddl , dd2,  deltacO,  delta2; 
i,  suma,  sumb,  sumc,  j,  Prob3,  dd3,  1,  m; 
Imean(),  Zero,  Ivar(),  Prob( ) , sum; 

" enter  frac,  radix,  deltacO,  lex  : 0); 

%f  %f  %f  %f",  &frac,  Sradix,  &deltac0,  &Iex); 


/ 


pr intf ( " THREE-LEVEL  EXPERIMENT  for  :0); 
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printf("frac  = %f, radix  = %f,  deltacO  = %f,  lex  « %fO, 
frac,  radix,  deltacO,  lex); 

11  'k'k'k'k-k'k'k’k'k'k'k’k'k’k'kit'k'kiK'k'k'k'M'k'k'k'k'k'k’k'k'k-k'kjflf'k'k'k'k'k'k  I*  ^ • 

p it x ri t f ( **  } 

lnr  = log( radix ) ; 

Q4  = pow( 2 . , - frac  ) ; 

/*  In  theory  this  is  the  precision  */ 

Zero  = pow(2.,  deltacO); 

/*  This  is  the  essential  zero  */ 

/* 

Calculation  of  the  average  variance  starts  here: 

V 

printf( " Break points variancelvariance2variance3" ) ; 
printf ( "Total  varianceO); 
i « 1 . ; 
j = i • ; 

delta3  = Q4 ; 

for  (j  = 1.;  j <=  deltacO  + frac  - 2.;  j++){ 
i - deltacO  -j; 
if  (i  < lex  - 2.)  { 

deltal  = pow(2.,  i); 
ddl  = Zero  - deltal; 
suml  = 0 . ; 
sum2  = 0 . ; 
sum3  - 0 . ; 

Probl  - Prob(Zero,  Zero,  deltal); 

QL  = pow(2.,  - frac  + lex);  /*  L = 0 */ 

suml  = Ivar ( Zero,deltal ) / ddl; 

suml  *=  Probl; 

printf ("%f",  deltal); 

printf ("  %e0,  suml); 

/*  Calculation  for  the  second  interval  now.  */ 

for  (m  = 1.;  m <=  i + frac  - 1.;  m++ ) { 
if  ( ( i + m)  < lex  - 2 . ) { 

1 = i - m; 

delta2  = pow(2.,  1); 
dd2  = deltal  - delta2; 

QL  = pow(2.,  -(frac  - lex)  - i);  /*  i - L */ 

Prob2  = Prob(Zero,  deltal,  delta2); 

sumb  = ( Imean( deltal , delta2 ) / dd2 ) * Prob2; 

sum2  » Ivar ( deltal , delta2 ) / dd2 ; 

sum2  *-  Prob2; 

printf ("%f",  delta2 ) ; 

printf ("  %e0,  sum2 ) ; 

dd3  » delta2  - delta3; 

QL  = pow(2.,  -(frac  - lex)  - (i  + m) ) ; 

/*  i+m  - L */ 

Prob3  = Prob(Zero,  delta2,  delta3); 

sumc  « ( Imean(delta2,delta3)  / dd3)  * Prob3; 

sum3  = Ivar ( delta2 , delta3 ) / dd3; 
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sum3  *=  Prob3; 
printf("%e  % e 0 

, sum3 , suml+sum2  + sum3 ) ; 

} 

else 

p r i n t f ( " 0 ) ; 

} 

} 

else 

p r i n t f ( " 0 ) ; 

} 

printf ( " TWO  LEVEL  EXPERIMENT  for:"  ); 
printf("Orac  = %fradix=%f  deltacO*%f  lex  = %f0, 
frac,  radix,  deltacO,  lex); 

**  'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'kif’k'k'k'k'kit'k'kif’kif'kit'k'k'k'k'k'k'k^'k  **  J • 

pr int f ( 11  ************************************0)  } 

/* 

Calculation  of  the  average  variance  starts  here: 

*/ 

printf ( "Breakpoints  variancel  variance2"); 
printf ("  Total  varianceO); 

i = 1 . ; 
j = 1 • ; 

delta3  = Q4; 

for  (j  = 1.;  j <=  deltacO  + frac  - 1.;  j++){ 
i = deltacO  -j; 
if  (i  < lex  - 2.)  { 
delta2  = pow(2.,  i); 
ddl  = Zero  - delta2; 
dd2  - delta2  - delta3; 
suml  = 0 . ; 
sum2  = 0 . ; 

QL  = pow(2.,  - frac  + lex);  /*  L = 0 */ 

Probl  = Prob(Zero,  Zero,  delta2); 
suml  = ( Ivar ( Zero,delta2 ) / ddl ) ; 
suml  *=  Probl; 

printf ("%f  %e  0,  delta2,  suml); 

/*  Calculation  for  the  second  interval  now.  */ 

Prob2  = Prob(Zero,  delta2,  delta3); 

QL  = pow(2.,  -(frac  - lex)  - i); 

sumb  = ( Imean( delta2 ,delta3 ) / dd2)  * Prob2; 

sum2  = Ivar ( delta2 , delta3 ) / dd2; 

sum2  *=  Prob2; 

printf("%f  %e  %e0, 

delta2,  sum2,  suml  + sum2 ) ; 

} 
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else 

p r i n t f ( " 0 ) ; 

} 

p r i n t f ( " NO  BREAKPOINT  EXPERIMENT  for:"); 
printf!"  frac=%f  radix=%f  deltac0=%f  Iex=%f0, 
frac,  radix,  deltacO,  lex); 
pr int f ( 11  ************************ *★***★*★★*****★ ,f  ) • 
pr int f ( 11  ************************************0)  } 

/* 

Calculation  of  the  average  variance  starts  here 

V 

delta3  = Q4 ; 

QL  = pow(2.,  - frac  + lex);  /*  L = 0 */ 

sum  = 0 . ; 
suma  = 0 . ; 

ddl  = Zero  - delta3; 

sum  = ( Ivar ( Zero,delta3  ) / ddl); 

suma  « ( Imean( Zero,delta3 ) / ddl); 

printf( "Average  TOTAL  variance  - %e0,  sum); 

printf ( "Average  TOTAL  mean  = %e0,  suma); 

printf ( "Breakpoints  Average  variance"); 

printf ( "Comprehensive  varianceO ) ; 

sum  = 0 . ; 

for  (j  =0.;  j <=  deltacO  + frac;  j++){ 
i = deltacO  - j; 
deltal  = pow(2.,  i); 
delta2  = pow(2.,  i - 1.); 
ddl  = deltal  - delta2; 
suml  * 0 . ; 

Probl  - Prob(Zero,  deltal,  delta2); 

suml  - ( Ivar ( deltal ,delta2 ) / ddl); 

suml  Probl; 

sum  +=  suml; 

printf("  %f  %f%e%e0, 

deltal,  delta2,  suml,  sum); 

} 


double  Imean(x2,  xl) 
double  xl,  x2 ; 

{ 

double  res,  mean(); 

res  » ((x2  - xl ) / 6.)  * (mean(x2) 

+ 4.  * mean((xl  + x2 ) / 2.)  + mean(xl) ) ; 
return! res ) ; 

} 

double  Ivar(x2,  xl) 
double  xl,  x2; 

{ 
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double  res , var ( ) ; 

res  = ((x2  - xl ) / 6.)  * 

(var(x2)  + 4.  * var( (xl  + x2)  / 2.)  + var(xl) ) ; 

return (res); 


} 

double  var(delta) 
double  delta; 

{ 


extern  double  radix,  lnr,  Q4,  QL; 
double  top,  res,  al,  a2,  bl , b2,  cl,  c2; 

double  EL ( ) , EL1(),  EL2 ( ) , mean(),  efl(),  ef2(),  ef3(); 
int  condition; 

top  = EL2(delta)  - ELl(delta); 
al  = EL ( delta ) - EL2(delta)  - Q4  / 2.; 
c2  = EL( delta ) - ELl(delta)  + Q4  / 2.; 
if  (Q4  <=  top)  { 

a2  = EL(delta)  - EL2(delta)  + Q4  / 2.; 

b2  = EL( delta ) - ELl(delta)  - Q4  / 2.; 

condition  =1; 

} 

else  if  (Q4  > top)  { 

a2  = EL( delta ) - ELl(delta)  - Q4  / 2.; 

b2  = EL(delta)  - EL2(delta)  + Q4  / 2.; 

condition  - 2; 

} 

bl  = a 2 ; 
cl  = b2 ; 

if  (condition  ==  1)  { 

res  = ((pow(a2,  4.)  - pow(al,  4.))  / 4.  + 

(pow(al,  4.)  - pow(a2,  3.)  * al  ) / 3.)  / Q4 ; 
res  +=  (pow(b2,  3.)  - pow(bl,  3.))  / 3.; 
res  -=  ((pow(c2,  4.)  - pow(cl,  4.))  / 4.  - 

(pow(c2,  4.)  - pow(cl,  3.)  * c2  ) / 3.)  / Q4; 
res  /=  top; 

res  -=  pow(raean( delta ) , 2.); 

} 

else  if  (condition  ==  2)  { 

res  = ((pow(a2,  4.)  - pow(al,  4.))  / 4.  + 

(pow(al,  4.)  - pow(a2,  3.)  * al  ) / 3.)  / (Q4  * 

( EL2 (delta ) - ELI ( delta ))) ; 
res  +=  (pow(b2,  3.)  - pow(bl,  3.))  / (3.  * Q4 ) ; 
res  -=  ((pow(c2,  4.)  - pow(cl,  4.))  / 4.  - 

(pow(c2,  4.)  - pow(cl,  3.)  * c2  ) / 3.) 

/ ( Q4  * ( EL2 (delta ) - ELI (delta) )) ; 
res  -=  pow(  inean(  delta  ) , 2.); 

} 

return( res ) ; 


} 
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double  mean(delta) 
double  delta; 

{ 


extern  double  radix , lnr , Q4 , QL ; 
double  top,  res,  al,  a2,  bl , b2 , cl,  c2; 
double  EL ( ) , ELl(),  EL2(); 
int  condition; 


top  = EL2(delta)  - ELl(delta); 
al  = EL(delta)  - EL2(delta)  - Q4  / 2.; 
c2  = EL(delta)  - ELl(delta)  + Q4  / 2.; 
if  (Q4  <=  top)  { 

a2  = EL ( delta ) - EL2(delta)  + Q4  / 2.; 
b2  = EL( delta ) - ELl(delta)  - Q4  / 2.; 
condition  = 1; 


} 

else  if  (Q4  > top)  { 

a2  = EL ( delta ) - ELl(delta)  - Q4  / 2.; 
b2  = EL(delta)  - EL2(delta)  + Q4  / 2.; 
condition  =2; 


} 


bl  = a2 ; 

cl  - b2 ; 

if  (condition  ==  1)  { 

res  = ((pow(a2,  3.)  - pow(al,  3.))  / 

3.  + (pow(al,  3.)  - pow(a2,  2.)  * al  ) / 2.)  / Q4; 
res  +=  (pow(b2,  2.)  - pow(bl,  2.))  / 2.; 
res  -=  ((pow(c2,  3.)  - pow(cl,  3.))  / 3.  - 
(pow(c2,  3.)  - pow (cl,  2.)  * c2  ) / 2.)  / Q4 ; 
res  /=  top; 

} 

else  if  (condition  ==  2)  { 

res  = ((pow(a2,  3.)  - pow(al,  3.))  / 3.  + (pow 
(al,  3.)  - pow(a2,  2.)  * al  ) / 2.)  / (Q4  * top); 
res  +-  (pow(b2,  2.)  - pow(bl,  2.))  / (2.  * Q4 ) ; 
res  — * ( ( pow( c2 , 3.)  - pow(cl,  3.))  / 3.  - (pow(c2 

^ 3.)  - pow (cl,  2.)  * c2  ) / 2.)  / (Q4  * top) 

return( res ) ; 


i 


} 

double  Prob(zero,  w2 , wl ) 
double  zero,  w2 , wl ; 

{ 


double  res; 

res  = (2.  / zero)  * ((w2  - wl ) - (pow(w2,  2.)  - 
pow(wl,  2.))  / (2.  * zero)); 
return( res ) ; 


} 


APPENDIX  L 

CODE  FOR  LATENCY  OPTIMIZATION  OF  AMP  LNS 


/********************************************************* 
* * 

* This  program  : 'tree.c'  provides  for  a study  of  the  * 

* associative  me  mory  LNS  processor  in  a tree  fashion.  * 

* For  each  of  the  #s  between  Low  and  High,  generates  a * 

* binary  representation  without  including  the  deci  mal  * 

* point.  It  then  forms  a tree  to  generate  the  output  * 

* from  the  input  and  gives  information  about  the  total  * 


* number  of  switches  and  leaf  nodes  of  the  tree.  * 

* Output  to  'screen'.  * 

* The  parameters  are:  * 

* base  : rad.  Input  precision  : prec  defined  by  fracl,  * 

* Output  precision  : prec2  defi  ned  by  frac2,  * 

* Low  and  High  define  the  range  of  the  v variable.  * 

* N (=  K for  a complete  tree)  specifies  the  # of  bits  * 

* for  the  final  binary  input  representation,  * 

* and  shift2  for  the  binary  Output  (mapping).  * 

* * 


/*  ******************************************************** 


♦include  <math.h> 

♦include  <stdio.h> 

♦define  SIZE  20 
♦define  BIG  65538 

int  old_s , nlL,  shift2,  nlH,  Rlow; 

int  K,  N,  w[ SIZE ] , n[ SIZE ] , s[SIZE],  index,  Rhigh; 
int  lim,  index,  Total_swi tches , final_nodes,  count; 
double  fracl,  frac2,  N1[BIG],  Nh [ BIG ] , low,  high; 
double  round( ) , rad,  prec,  prec2,  logr(),  v; 
double  highvmf  BIG ] , lowvm[BIG],  highvt[BIG],  lowvt[BIG]; 

main( ) 

{ 


int  map,  L[SIZE],  produce(); 

printf( "Enter  fracl,  frac2,  low,  high,  rad  0); 
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scanf("%f  %f  %f  %f  %f " , &f racl , &f rac2  , slow,  &high,  &rad); 

prec  = pow(2.,  - fracl); 
prec2  = pow(2.,  - frac2); 

if  ( round( logr ( round( low,  prec)),  prec2)  >=  0.99999) 
shift2  = frac2  + 1; 
else 

shift2  = frac2; 
if  (high  >=  1.) 

N = fracl  + f loor ( log ( floor ( high ) ) / log(2.))  + 1; 
else 

N = floor ( log ( pow( prec , -1.)  * high)  / log(2.))  + 1; 

lim  = ( BIG< ( int ) pow( 2 .,( double ) shi ft2 ) ) ? 

BIG  : pow( 2 .,( double ) shi ft2 ) ; 

printf("0N  = %dF  = %d0,  N,  shift2); 

printf ( "Enter  K=N  ? , Rlow , Rhi gh=l  ?nlL=?nlH  - ?0); 

scanf("%d  %d  %d  %d  %d",  &K,  &Rlow,  SRhigh,  &nlL,  &nlH); 

printf ("0  = %dK  = %dRlow  = %dRhigh  = %d0, 

N,  K,  Rlow,  Rhigh) ; 

printf (" fracl  = %.lffrac2  = %.lflim  = %d0, 

fracl , f rac2 , lim)  ; 

printf (" Input  precision  = %gOutput  precision  - %g0, 

prec,  prec2); 

printf("low  ■=  %fhigh  - %frad  « %.lf0, 

low,  high , rad ) ; 

printf ("nlL  = %dnlH  = %d0,  nlL,  nlH); 

for  (index  = 0;  index  < lim;  index++){ 

Nl [ index ] = 0 ; 

Nh[ index]  = 0; 
lowvm[ index]  = 0; 
highvm[ index ] = 0; 
lowvt[ index]  = 0; 
highvt [ index ] = 0; 

} 

for  (index  = 0;  index  < SIZE;  index++){ 

L[ index]  = 0; 
n[ index]  = 0; 
s( index ] = 0; 
w[ index j = 0; 

} 

L[l]  = N - K + 1 ; 

if  (K  ==  1)  { 

n [ 1 ] = N; 

printf("  n [ 1 ] = %d0,  n [ 1 ] ) ; 
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} 

else 

for  (index  = 1;  index  <=  K;  index++) 

produce(L,  index); 

} 

double  logr(arg) 
double  arg; 

{ 

return( log( 1 . + pow(rad,  - arg))  / log(rad)); 

} 

double  round(arg,  Q) 
double  arg,  Q; 

{ 

return( f loor ( 0 . 5 + arg  / Q)  * Q) ; 

} 

produce(L,  index) 
int  L[ SIZE] ; 

{ 


int  nodes,  j,  condition,  Total_nodes,  m,  flag; 
int  sumn,  Parts,  t,  intvl,  intv2,  intmapl,  intmap2; 
double  vl,  v2 , mapl,  map2.  In; 

for  (w[index]  = 1;  w[index]  <=  L[index];  w(index]++){ 

n[ index]  = w( index]; 

L[index+1]  = L[index]  - w[index]  + 1; 

if  (index  < K - 1)  { 

produce (L,  index  + 1); 

} 

else  { 

n[ index  + 1]  = L[ index  + 1]; 
condition  = 1; 

for  (j  - 1;  j <=  K;  j++) 

if  ( (n[ j ]>Rhigh)  ||  (n[j]<Rlow) 

||  ( n [ 1 ] <nlL ) ||  ( n[ 1 ] >nlH ) ) 
condition  = 0; 


if  (condition  ==  1){ 
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Total_swi tches  = 0; 

Total_nodes  = 0; 
old_s  = 1; 
sumn  = 0; 

Nl [ 1 ] = low ; 

Nh[lj  = high; 

for  (j  = 1;  j <=  K;  j++){ 

for  (m  = 1;  m <=  old_s;  m++)  { 
lowvm[ m ] = Nl [ m] ; 
highvm[m]  = Nh[m]; 

} /*  for  m */ 

count  = 0; 
sumn  +=  n[ j ] ; 
final_nodes  = 0; 

for  (m  = 1;  m <=  old_s;  m++ ) { 

Parts  = ceil ( ( highvm[ m]  - lowvm[m])  / 
(pow(2.,  (double)  (N  - sumn))  * prec)); 
flag  = 0; 

for  (t  = 1;  t <=  Parts;  t++ ) { 

lowvt[t]  « lowvm[m]  + (t  - 1)  * prec 
* pow(2.,  (double)  (N  - sumn)); 
highvt[t]  = lowvt[t]  + prec  * 

(pow(2.,  (double)  (N  - sumn))  - 1); 

if  (lowvt[t)  < high)  { 

for  ( v-lowvt [ t ] ; v<-highvt [ t ] 

-prec ; v+=prec ) { 

vl  = round(v,  prec); 
v2  = round(v  + prec,  prec); 
intvl  = pow(prec,  -1.)  * vl ; 
intv2  = pow(prec,  -1.)  * v2; 
intvl  <<=  (32  - sumn); 
intv2  <<=  (32  - sumn); 
mapl  = round ( logr ( vl ) , prec2); 
map2  = round( logr ( v2 ) , prec2); 
intmapl=pow( prec2 , -1.)  * mapl; 
intmap2-pow( prec2 , -1.)  * map2; 
intmapl  <<=  (32  - shift2); 
intmap2  <<=  (32  - shift2); 

if  ((intmapl  " intmap2)  1*0)  { 

count++ ; 
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if  ( count  >=  lim) { 
printf ( "count  = %d  > limO, 
count ) ; 
e x i t ( ) ; 

} 

Nl[ count]  = lowvt[t]; 
Nhfcountj  = highvt[tj; 
break; 

} /*  if  V 

else  if  ( ( v2  == 

round( highvt[ t ] , prec))  && 
(flag  ! = 1))  { 

f inal_nodes++ ; 
flag  - 1; 

} /*  else  if  */ 

} /*  for  v */ 

} /*  if  lowvt  */ 

} /*  for  t */ 

} /*  for  m */ 

old_s  - count; 

Totalswitches  +»  count; 

Totalnodes  +=  final_nodes; 

if  ( j «==  K - 1 ) 

nodes  = old_s; 

i f ( old  s ==  0 ) { 
s [ j ] = (J; 

printf("n[%d]=%ds[ %d ] =%dswi tches-%d" 
rj,  n [ j ] , j,  s[j],  old_s); 
printf ("#  of  final  nodes  = %d0, 
f inalnodes ) ; 

break ; 

} /*  if  old_s  */ 
else  { 

In  = log( (double)  old_s)  / log(2.); 

if  ( old_s  ==  1) 
s [ j ] = 1 ; 

else  if  ((In  - floor(ln))  O 
0.0000000000000001) 
s [ j ] = floor(ln); 
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else  if  (In  - floor(ln)  > 0.) 
s[j]  = floor(ln)  + 1; 

} /*  else  */ 

printf( "n[ %d]=%ds[%d]=%dswitches=%d" 

/ j,  n[j],  j,  s[j],  old_s ) ; 
printf("#  of  final  nodes  = %d0, 
f inal_nodes ) ; 

} /*  for  j */ 

p r i n t f ( " 0 ) ; 

printf ( "Total  # of  switches  = %d0 , 
Total_switches ) ; 

printf ( "Total  # of  nodes  = %d0  , 

Total  nodes  + nodes ) ; 

if  ( ( n[ 1 ] ==  N - K + 1)  ||  ( n[ 1 ] — Rhigh)) 

exit ( ) ; 

} /*  if  condition  */ 

} /*  else  index  */ 

} /*  for  w[ ] */ 

} /*  produce ( ) */ 


APPENDIX  M 

CODE  USED  TO  GENERATE  THE  ENTRIES  OF  TABLE  7.3 


/********************************************************* 
* * 

* This  program  : 'am_proc.c'  provides  for  a study  of  the  * 

* associative  memory  LNS  processor.  For  each  of  the  #s  * 

* between  Low  and  High,  generates  a binary  representation* 

* without  including  the  decimal  point.  It  then  * 

* partitions  the  representation  array  of  N elements  into  * 

* K adjoint  subarrays  and  forms  the  N[i]  and  S[i]  * 

* parameters  for  the  associative  memory  Logarithmic  * 

* processor  study.  * 

* * 

* Output  to  'screen'.  * 

* * 

* The  parameters  are:  * 

* * 

* radix  : rad,  Input  precision  : pre  defined  by  fracl,  * 

* Output  precision  : prec2  defined  by  frac2,  * 

* Low  and  High  define  the  range  of  the  v variable.  * 

* N specifies  the  # of  bits  for  the  final  * 

* binary  input  representation,  and  * 

* shift2  for  the  Output  (mapping)  representation  * 

* * 

********************************************************* y' 

#include  <math.h> 

♦include  <stdio.h> 

♦define  SIZE  20 
♦define  BIG  50000 

int  K,  N,  w[ SIZE ] , n[ SIZE ] , s[SIZE],  index,  Rhigh; 

int  index,  shift2,  old  s,  Rlow,  count; 

double  fracl,  frac2,  nT[BIG],  Nh[ BIG ] , low,  high; 

double  round(),  rad,  prec,  prec2,  logr(),  v; 

double  highvm[BIG],  lowvm[BIG],  highvt[BIG],  lowvt[BIG]; 

main( ) 

{ 

int  L[ SIZE ] , map,  produce(); 

printf( "Enter  fracl,  frac2,  low,  high,  rad  0); 


211 


212 


scanf("%f  %f  %f  %f  %f " , &f racl , &f rac2 , &low,  shigh,  &rad); 

prec  = pow(2.,  - fracl); 
prec2  = pow(2.,  - frac2); 

map  = round ( logr ( round ( low,  prec)),  prec2); 
map  *=  pow(prec2,  -1.); 

i f ( map  >=  1 . ) 

shift2  = frac2  + 1; 
else 

shift2  = frac2; 
if  (high  >=  1 . ) 

N = fracl  + f loor ( log ( floor ( high ) ) / log(2.))  + 1; 
else 

N = floor ( log( pow( prec , -1.)  * high)  / log(2.))  + 1; 

printf("N  - %dF  - %d0,  N,  shift2); 
printf ( "Enter  K,  Rlow,  RhighO); 
scanf("%d  %d  %d",  &K,  &Rlow,  &Rhigh); 
printf ( "N  * %dK  = %dRlow  = %dRhigh  = %d0, 

N,K,  Rlow,  Rhigh) ; 

printf ( "fracl  = %.lffrac2  = %.lfO,  fracl,  frac2); 
printf (" Input  precision  = %g  Output  precision  - %g0, 
prec , prec2 ) ; 

printf("low  = %fhigh  = %frad  = %.lfO, 
low,  high,  rad ) ; 

for  (index  - 0;  index  < BIG;  index++){ 

Nl[index]  - 0; 

Nh[ index]  - 0; 
lowvmf index]  = 0; 
highvm[ index ] = 0; 

} 

for  (index  = 0;  index  < SIZE;  index++){ 

L[ index]  = 0; 
nfindex]  = 0; 
s[ index]  = 0; 
w[ index]  = 0; 

} 

L[l]  =N—  K+l; 

if  ( K ==  1 ) { 
n [ 1 ] = N; 

printf("  n[l]  = %d0,  n[l]); 

} 

else 

for  (index  = 1;  index  <=  K;  index++) 

produce(L,  index); 


} 
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double  logr(arg) 

double  arg; 

{ 

return( log( 1 . + pow(rad,  - arg))  / log(rad)); 

} 

double  round(arg,  Q) 
double  arg,  Q; 

{ 

double  res; 

res  = floor(0.5  + arg  / Q)  * Q; 
return( res ) ; 

} 

produce(L,  index) 
int  L[ SIZE] ; 

{ 


int  j,  condition,  m; 

int  suran,  Parts,  t,  intvl,  intv2,  intmapl,  intmap2; 
double  vl,  v2,  mapl,  map2 , In; 

for  (w[index]  = 1;  w[index]  <=  L[index];  w[index]++){ 
n[ index]  = w[ index]; 

L[index+1]  = L[index]  - w[index]  + 1; 

if  (index  < K - 1)  { 

produce (L,  index  + 1); 

} 

else  { 

n[index  + 1]  » L[index  + 1]; 

condition  = 1; 

for  (j  = 1;  j <=  K;  j++) 

if  ( ( n[ j ] > Rhigh)  ||  ( n [ j ] < Rlow) ) 

condition  = 0; 

if  (condition  ==  1){ 

old_s  = 1; 
sumn  = 0 ; 

N 1 [ 1 ] = 1 ow ; 

Nh[l]  = high; 

for  (j  =1;  j O K;  j++){ 

for  (m  - 1;  m <«  old  s;  ra++)  { 
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lowvm[ m]  = Nl [ m] ; 
highvm[m]  = Nh[m]; 


count  = 0; 
sumn  +=  n[  j ] ; 

/* 

printf("sumn  = %d0,  sumn); 

V 

for  (m  = 1;  m <=  old_s;  m++ ) { 

Parts  = cei 1 ( ( highvm[ m ] - lowvm[m] )/ 

( pow( 2 .,( double ) (N-sumn))  * prec)); 
/* 

printf ( "Parts  = %d0,  Parts); 

V 

for  (t  = 1;  t <=  Parts;  t++)  { 

lowvt[ t]=lowvm[m]+( t-1 ) *prec 
* pow(2.,  (double)  (N-sumn)); 
highvt[t]  = lowvt[t]  + prec  * 

( pow( 2 .,( double ) (N-sumn) )-l ) ; 

if  (lowvtft]  < high)  { 
for  ( v=lowvt[ t] ; 

v<=highvt [ t j -prec ; v +-prec){ 

vl  = round(v,  prec); 
v2  = round(v  + prec,  prec); 
intvl  = pow(prec,  -1.)  * vl; 
intv2  = pow(prec,  -1.)  * v2; 
intvl  <<=  (32  - sumn); 
intv2  <<=  (32  - sumn); 
mapl«=round(  logr  ( vl ) , prec2); 
map2=round( logr ( v2 ) , prec2); 
intmapl=pow( prec2 , -1 . )*mapl; 
intmap2=pow( prec2 , -1 . ) *map2 ; 
intmapl  <<=  (32  - shift2); 
intmap2  <<=  (32  - shift2); 

if( (intmapl  ~ intmap2)  !=  0){ 

count++ ; 

if  (count  >«  BIG) 
printf ( "count-%d  > %d0, 
count , BIG ) ; 

Nl[  count]  ■=  lowvt[t]; 

Nh[ count]  = highvtft]; 
break ; 
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} 


} 


} 


} 


old_s  = count; 
i f ( old  s ==  0 ) { 

s[jT  = 0; 

printf ( "n[ %d ] = %ds[%d]  = %dstates  = %d0 

, j,  n [ j ] , j,  s[j],  old_s ) ; 

break ; 

} 

G 1 S 6 { 

In  = log((double)  old_s)  / log(2.); 
if  (old  s ==  1) 

s [ jT  = 1 ; 

else  if  ((In  - floor(ln)) 

<=  0.0000000000000001) 
s [ j ] = f loor ( In ) ; 
else  if  (In  - floor(ln)  > 0.) 
s [ j ] = floor(ln)  + 1; 


pr intf ( "n ( %d ] = %ds[%d]  = %dstates  = %d0 

, j,  n [ j ] , j,  s[j],  old_s ) ; 


} 

printf ( "0) ; 

if  ( n[ 1 ] ==  N - K + 1) 
exit( ) ; 


} 


} 


} 


} 


APPENDIX  N 

PROGRAM  FOR  SIMULATION  OF  THE  LNS  CORDIC  PROCESSOR 


/********************************************************* 
* * 

* This  program  performs  a simulation  for  the  * 

* conventional  CORDIC  trigonometric  processor  * 

* and  the  equivalent  LNS  trigonometric  engine.  * 

* The  LNS  results  are  also  compared  to  a FLP  * 

* implementation  with  double  precision  (64  bits)  * 

* as  well  as  to  a FXP  implementation.  * 

* * 


* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

if 


The  parameters  involved  are  : * 

frac,  Qcor,  Qlog  : # of  bits  for  FXP,  CORDIC,  and  * 
LNS  realization  respectively.  * 

step  : determines  the  number  of  intervals  in  which  * 
the  90°  interval  is  split,  for  computation  * 

of  angles  and  radii.  * 

Cordic_iter  : gives  the  # of  CORDIC  iterations  * 

necessary  to  achieve  maximum  possible  precision.  * 
const  : K of  Cordic  equations  ((9.4)  or  (9.5))  * 

* 

Output  offers  the  square  of  the  error  of  all  three  * 

realizations  when  compared  to  the  FLP  one.  * 

* 


tinclude  <math.h> 
tinclude  <stdio.h> 
#define  size  1001 
#def ine  SIZE  1001 
#define  COEFF  10 
tdefine  pe  3.141592654 
double  rad,  lowexp; 
int  Underflow; 

main(  ) 

{ 


double 

double 

double 


lx[SIZE],  el3,  el5,  el7; 

div,  logadd(),  step,  logmul(); 

QT,  logang[ SIZE ] , gonia,  sigma; 
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double  TFLP,TFXP,Tlog, TCOR, FXPang[SIZE] , FXPrad[ SIZE ] ; 

double  TFLPrad,  TFXPrad,  corrad[ SIZE] , TCORrad; 

double  Cordic_Iter,  Tlograd,  const; 

double  trunc(),  Qcor , Ycord[ SIZE ] ; 

double  ATP [SIZE] , angle [ SIZE ] , Xcord[SIZE]; 

double  lang[SIZE],  temp,  lrad[SIZE],  corangle [ SIZE ] ; 

double  f lpang[ SIZE ] , Qlog , f lprad [ SI ZE ] ; 

double  cartxfSIZE] , ly[ SIZE ] , carty[SIZE]; 

double  ICC(),  frac,  Int,  Quant,  Quantl; 

double  round(),  el9,  elpe2,  FCC(); 

int  j,  i,  signadd( ) , sign(),  sx[SIZE],  signmul(); 

int  sang[SIZE],  sdiv,  stemp,  sy[SIZE],  srad[SIZE]; 

printf("  Enter  frac,  step  0); 

printf ("  Enter  Cordic_Iter,  const,  Qcor,  Qlog  ; 0); 
scanf ( "%f%f%f%f%f%f", 

&frac,  &step , &Cordic_I ter , sconst , &Qcor,  &Qlog); 

Quant  - pow(2.0,  frac); 

QT  - pow(2.0,  Qcor); 

Quantl  = pow(2.0,  Qlog); 
rad  = 2.0; 

Int  = 4.0; 

el3  = round( FCC( 3 . ) , Quantl); 

el5  « round( FCC( 5 . ) , Quantl); 

el7  = round( FCC( 7 . ) , Quantl); 

el9  = round( FCC( 9 . ) , Quantl); 

elpe2  = round(FCC(pe  / 2.),  Quantl); 

lowexp  = - pow(2.,  Int); 

printf ( "lowexp  = %f0, lowexp  ); 

for  (i  = 3;  i <=  19;  i++)  { /*  CORDIC  Coefficients  */ 

ATP [ i ] = atan(pow(2.,  -(i  - 3.)))  * 180.  / pe ; 

} 

Underflow  = 0; 

printf ("i  radius  angle  sin  cosineO); 

for  (i=0;  i < (90.  / step);  i++)  { 

cartx[i]  = cos(pe  * i * step  / 180.); 

carty[i]  = sin(pe  * i * step  / 180.); 

flprad[i]  = hypot( cartx[ i ] , carty[i]); 

flpangfij  = atan2 ( carty [ i ] , cartx[i])  * 180.  / pe ; 

if  ( round( cartx[ i ] , Quant)  !=  0.0) 

sigma  = round( round( carty[ i ] , Quant ) 

/round( cartx[ i ] , Quant),  Quant); 

else 

sigma  = round( carty [ i ] / cartx[i],  Quant); 
lx[i]  = round( FCC( cartx[ i ] ) , Quantl); 
sx[ij  = sign( cartx[ i ] ) ; 
ly[ij  = round( FCC( carty [ i ]) , Quantl); 
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sy[ i ] = sign( carty[ i ] ) ; 

FXPrad[i]  =round ( sqrt ( round 

( pow( round ( car ty [ i ], Quant ), 2 .)  , Quant)  + 
round( pow( round( cartxf i ] , Quant),  2.), 

Quant ) ) , Quant ) ; 

lrad[i]  = logadd(2.  * lx[i],  sx[i],  2.  * 
ly[ i ] » sy[ i ] ) / 2 . ; 

srad[i]  = signadd(2.  * lx[i],  sx[i],  2.  * 
ly[i],  sy [ i ] ) ; 

lrad[i]  = ICC(lrad[i],  srad  [i]); 
div  = 1 y [ i ] - lx [ i ] ; 
sdiv  = sy [ i ] * sx[ i ] ; 

if  (pow(sigma,  2.)  < 1.)  { 

FXPang[i]  = round ( round ( round (( round( s igma , Quant) 

- round( round(pow( sigma,  3.),  Quant)  / 3.,  Quant) 

+ round( round( pow( sigma , 5.),  Quant)  / 5.,  Quant) 

- round( round( pow( sigma , 7.),  Quant)  / 7.,  Quant) 

+ round( round( pow( sigma , 9.),  Quant)  / 9.,  Quant) 

),  Quant)  * 180.,  Quant)  / pe,  Quant); 

lang[i]  = logadd(div,  sdiv,  3.  * div  - el3,  - sdiv); 
sang[ij  = signadd(div,  sdiv,  3. 

* div  - el3 , - sdiv ) ; 
temp  - logadd( lang[ i ] , sang[i],  5. 

* div  - el5 , sdiv ) ; 
stemp  » signadd( lang[ i ] , sang[i],  5. 

* div  - el5,  sdiv); 
lang[i]  = logadd(temp,  stemp,  7. 

* div  - el7,  - sdiv); 
sang[i]  * signadd( temp,  stemp,  7. 

* div  - el7,  - sdiv); 
lang[i]  = logadd ( lang [ i ] , sang[i],  9. 

* div  - el9 , sdiv ) ; 

sangfi]  = signadd ( lang [ i ] , sang[i],  9. 

* div  - el9 , sdiv ) ; 


} 

else  if  (sigma  >=  1.)  { 

FXPang[i]  = round( round( round( ( round( 1 ./round(-sigma, 
Quant ), Quant ) + round( round( pe , Quant)  / 2.,  Quant) 
+ round(l.  /round( round(pow( sigma,  3.),  Quant) 

* 3.,  Quant),  Quant)  - round(l.  /round ( round ( pow( 
sigma,  5.),  Quant)  * 5.,  Quant),  Quant)  + round(l. 
/round( round( pow( sigma,  7.),  Quant)  * 7.,  Quant), 
Quant)  - round(l.  /round( round( pow( sigma , 9.), 
Quant)  * 9.,  Quant),  Quant))  , Quant)  * 180., 

Quant)  / pe,  Quant); 

lang[i]  = logadd(-  div,  - sdiv,  - 3. 

* div  - el3 , sdiv ) ; 

sang[i]  - signadd(-  div,  - sdiv,  - 3. 
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* div  - el3,  sdiv ) ; 

temp  = logadd ( lang [ i ] , sang[i],  - 5. 

* div  - el5,  - sdiv); 

stemp  = s i gnadd ( lang [ i ] , sang[i],  - 5. 

* div  - el5,  - sdiv); 

lang[i]  = logadd(temp,  stemp,  - 7. 

* div  - el7 , sdiv ) ; 

sang[i]  = signadd ( temp , stemp,  - 7. 

* div  - el7,  sdiv); 

lang[i]  - logadd ( lang [ i ] , sang[i],  - 9. 

* div  - el9,  - sdiv); 

sang[i]  = signadd( lang[ i ] , sang[i],-9. 

* div  - el9 , - sdiv ) ; 

lang[i]  = logadd ( lang [ i ] , sang[i],  elpe2,  1); 
sang[i]  = signadd( lang[ i ] , sang[i],  elpe2,  1); 

} 


else  if  (sigma  <=  - 1.)  { 

FXPang[ij  = round( round( round( ( round( 1 ./round( -sigma , 
Quant ), Quant ) - round( round(pe,  Quant)  / 2.,  Quant) 

+ round(l.  /round( round( pow( sigma,  3.),  Quant)  * 3., 
Quant),  Quant)  - round(l.  /round( round( pow( sigma , 5. 

),  Quant)  * 5.,  Quant),  Quant)  + round(l.  /round ( 
round( pow( sigma , 7 .) , Quant)  * 7.,  Quant),  Quant)  - 
round(l.  /round ( round ( pow( sigma , 9.),  Quant)  * 9., 
Quant),  Quant))  , Quant)  * 180.,  Quant)  / pe , Quant); 
lang[i]  = logadd(-  div,  - sdiv,  - 3. 

* div  - el3 , sdiv ) ; 

sang[i]  = signadd(-  div,  - sdiv,  - 3. 

* div  - el3 , sdiv ) ; 

temp  = logadd( lang[ i ] , sangfi],  - 5. 

* div  - el5 , - sdiv ) ; 

stemp  - signadd( lang( i ] , sang[i],  - 5. 

* div  - el5,  - sdiv); 

lang[i]  - logadd(temp,  stemp,  - 7.  * div  - el7,  sdiv); 
sang[ij  = signadd ( temp , stemp,  - 7.  * div  - el7,  sdiv); 
lang[ij  = logadd( lang[ i ] , sang[i],  - 9. 

* div  - el9,  - sdiv); 

sangfi]  = signadd( lang[ i ] , sang[i],-9. 

* div  - el9,  - sdiv); 

lang(i]  = logadd( lang[ i ] , sang[i],  elpe2,  - 1); 
sang[i]  = signadd( lang[ i ] , sang[i],  elpe2,  - 1); 

} 

logang(i]  = ICC(lang[i],  sang[i])  * 180.  / pe; 
logang(ij  -=  floor ( logangl i ] / 90.)  * 90.; 

/*  CORDIC  STARTS  HERE  */ 

Ycord[l]  = trunc ( carty [ i ] , QT); 

Xcord[l]  = trunc ( cartx [ i ] , QT); 
if  ( Ycord[ 1 ] ==  0. ) 

coranglefi]  = 0.0; 
else  if  (Ycord[l]  > 0.)  { 

Ycord[2]  = - trunc ( Xcord( 1 ] , QT); 

Xcord[2]  = trunc( Ycord[ 1 ] , QT); 


220 


angle [2]  = 90 . ; 

} 

else  if  (Ycord[l]  < 0.)  { 

Ycord[2]  = trunc ( Xcord [ 1 ] , QT); 

Xcord[2]  = - t rune ( Ycord [ 1 ] , QT); 
angle [ 2 ] = - 9 0.; 

} 

i f ( Ycord [2]  ==  0 . ) 

corangle [ i ] = 0.0; 
else  { 
j = 3; 

while  ((j  <=  Cordic_Iter)  &&  (Ycord[j  - 1]  !=  0.0)  ) { 
Ycord[j]  = trunc( Ycord[ j - 1 ]-sign( Ycord[ j - lj) 

* Xcord[ j - 1]  * pow( 2 . , -(j  - 3.)),  QT); 
Xcord[j]  = trunc ( Xcord[ j - 1]  + sign(Ycord[j  - 1]) 

* Ycord [ j - 1]  * pow( 2 . , - (j  - 3.)),  QT)  ; 
angle[j]  = trunc ( angle [ j-1 ] +sign ( Ycord [ j-1 ] ) 

* ATP [ j ] , QT ) ; 

j++; 

} 

corangle[i]  = angle[j-l]; 
corrad[i]  = Xcord[j-l]  / const; 

} 

printf("%d  FLP%f %f %f %f 0 , i, 

flprad[i],  flpang[i],  carty[i],  cartx[i]); 
printf ( "%d  FXP%f%f0,  i,  FXPrad[i],  FXPang[i]); 
printf("%d  log%f%f0,  i, 

lrad[i],  logang[i]  -floor ( logang[ i ] / 90.)  * 90.); 
printf ("%d  COR%f%fCONST  - %f0,  i, 

corrad[i],  coranglefi],  Xcordtj-1]); 

} 

TFLPrad  = 0.0; 

TFLP  = 0.0; 

TFXPrad  = 0.0; 

TFXP  = 0.0; 

Tlograd  = 0.0; 

Tlog  = 0.0; 

TCORrad  = 0.0; 

TCOR  - 0.0; 

for  (i  = 1;  i < (90.  / step);  i++)  { 

if  ( fabs( logang[ i ] - 45.)  > 7.)  { 

TFLP  +=  pow( f lpang [ i ] , 2.); 

TFXP  +=  pow( FXPang[ i j - flpang[i],  2.); 

Tlog  +=  pow( logang [ i j - flpangfi],  2.); 

TCOR  +=  pow( corangle [ i ] - flpangfi],  2.); 

TFLPrad  +=  pow( f lprad( i ] , 2.); 

TFXPrad  +=  pow( FXPrad[ i ] - flprad[i],  2.); 

Tlograd  +=  pow(lrad[i]  - flprad[i],  2.); 

TCORrad  +=  pow( cor  rad [ i ] - flprad[i],  2.); 

} 

} 

TFXP  = sqrt ( TFXP  / TFLP); 

Tlog  = sqrt(Tlog  / TFLP); 
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} 


TCOR  = sqr t ( TCOR  / TFLP ) ; 

TFXPrad  = sqrt ( TFXPrad  / TFLPrad); 

Tlograd  = sqrt(Tlograd  / TFLPrad); 

TCORrad  = sqrt(TCORrad  / TFLPrad); 
printf ( "***  ERRORS  :0FXP  = %fTlog  - %fTCOR  - %f0, 
TFXP , Tlog,  TCOR); 

printf (" OFXPrad  = %fTlograd  = %fTCORrad  = %f0, 
TFXPrad,  Tlograd,  TCORrad); 


double  round(arg,  Q) 
double  arg,  Q; 

{ 


double  res; 

res  = floor(0.5  + arg  * Q)  / Q; 
return (res); 


} 

double  trunc(arg,  Q) 
double  arg,  Q; 

{ 

double  res; 

res  - floor(arg  * Q)  / Q; 
return ( res) ; 


double  logadd(argl,  si,  arg2,  s2) 

/*  Produces  the  LNS  magnitude  exponent  of  the 
addition  of  two  numbers  */ 

double  argl,  arg2; 
int  si,  s2; 


double  res,  temp; 

double  round(),  addmap(),  submap(); 

/*  Round  the  address  v to  8 bits  */ 
temp  = round(argl  - arg2,  pow(2.,  8.)); 
if  (si  ==  0) 
res  = arg2; 
else  if  ( s2  ==  0) 
res  = argl; 
else  if  ( s2  «*■  si)  { 
if  ( temp  >=  0 . ) 

res  = argl  + addmap( temp ) ; 
else 

res  = arg2  + addmap(  - temp); 

} 

else  { 

if  ( temp  > 0 . ) 
res  = argl  + 


submap ( temp ) ; 
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else  if  ( temp  ==  0 . ) 

res  = 0.;  /*  sign  is  also  set  to  0 */ 

else 

res  = arg2  + submap(  - temp); 

} 

return( res) ; 

} 

double  addmap(arg) 

/*  Produces  the  unrounded  logarithmic  mapping  corresponding 
to  the  addition  of  two  numbers  (given  by  $(v)  ) */ 

double  arg; 

{ 

extern  double  rad; 
double  res; 

res  - log(l.  + pow(rad,  - arg))  / log(rad); 
return( res ) ; 

} 

double  submap(arg) 

/*  Produces  the  logarithmic  mapping  corresponding  to  the 
subtraction  of  two  numbers  (given  by  Y(v)  ) */ 

double  arg; 

{ 

extern  double  rad; 
double  res; 

res  = log(l.  - pow(rad,  - arg))  / log(rad); 
return ( res ) ; 

} 

signadd( argl , si,  arg2,  s2) 

/*  Determines  the  first  sign  of  the  LNS  exponent  of  the 

result  of  the  addition  (subtraction)  of  two  numbers  */ 
double  argl,  arg2; 
int  si,  s2; 

{ 

int  sign; 
double  temp; 


if  (si  ==  0) 

sign 

= s2  ; 

else  if 

( s2  == 

0) 

sign 

= si; 

else  if 

( s2  =» 

si) 

sign 

- si; 

else  { 

temp 

= argl 

- a 

if  ( temp  > 0 . ) 
sign  = si; 

else  if  ( temp  ==  0 . ) 
sign  = 0 ; 
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else 

sign  = s2 ; 

} 

return ( sign ) ; 


sign(arg)  /*  Determines  the 

first  LNS  sign  (of  the  real  number)  */ 

double  arg; 


{ 


} 


int  sign; 
if  (arg  > 
sign 
else  if 


sign 
else 

sign  = -1; 
return ( sign ) ; 


pow( rad 
= 1; 

( f abs ( arg ) 

= 0; 


lowexp ) ) 

< pow( rad , 


lowexp ) ) 


double  FCC(arg) 

/*  Returns  the  LNS  exponent  of  the  absolute  value 
( not  rounded)  */ 
double  arg; 

{ 

extern  double  lowexp,  rad; 
double  res; 

if  (fabs(arg)  >=  pow(rad,  lowexp)) 
res  - log ( f abs ( a rg ) ) / log(rad); 
else 

res  = 0.0; 
return( res ) ; 


double  ICC( arg, sign) 

/*  Performs  the  conversion  from  LNS  to  FLP  */ 
double  arg; 
int  sign; 


{ 

extern  double  rad; 
double  res; 

res  = pow(rad,  arg)  * sign; 
return( res ) ; 

} 


APPENDIX  0 

CODE  SIMULATING  AND  ANALYZING  THE  FLP , FXP  AND  LNS 
VERSIONS  OF  ECHO  CANCELLERS 


************************ 


********************************* 


* 

* 

* 

k 

* 

* 

* 

* 

* 

•k 

* 

★ 

★ 

* 

★ 

* 

** 


* 

This  program  simulates  an  echo  canceller  implemented  * 
in  the  Floating  Number  System.  * 

Passes  its  output  to  the  file  'outflp'  . * 

k 

The  parameters  used  are:  * 


TOTAL  : # of  numbers  ( xl  ...  xN)  <=  SIZE  = 1000  * 

Ensemble  : # of  times  that  echo  cancelling  * 

is  repeated.  * 

power  of  2 : For  power  = 3 the  normally  distributed  * 
numbers  extend  from  -8  to  8.  * 

K,  Alpha  : Loop  gain  and  Alpha  shown  below  * 

(0.0125,  .3)  * 

^aPs  • Number  of  weight  coefficients  * 

Delay  : y [ k ] = A * x[k  - delay]  * 


* 

k 

A******************************************************/ 


#include  <math.h> 

♦include  <stdio.h> 

♦define  SIZE  1000 
main( ) 

{ 

int  t,  i,  k,  delay,  j,  n,  total,  sum; 
int  Ensemble,  low,  high; 
double  var,  mea,  ave[SIZE]; 

double  limit,  alpha,  gauss  ran,  mean,  variance; 

double  QF,  K,  Alpha; 

double  y [ SIZE ] , y hatfSIZE]; 

double  x [ SIZE ] , eT25][SIZE],  hi  256 ][ SIZE] , rose; 

FILE  *fp,  *f open( ) , *f close ( ) ; 

fp  = fopen( "outflp" , "w" ) ; 

printf ( " TOTAL  # of  numbers  (10000  ?), 

Ensemble  size  (3  ?)  :0); 
scanf ( "%d%d" , stotal,  SEnsemble); 
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printf("  Enter  power  of  2 for  dynamic 
range(limit)  (4  ?):  0); 
scanf("%f",  slimit); 

printf("  Enter  total  # of  TAPS  (256  ?), 

DELAY  ( 5 ? ) : 0 ) ; 

scanf ( " %d%d" , &N,  sdelay); 
low  = 0; 
high  = 10000; 

alpha  = 6.  / pow(2.,  limit); 
sum  = pow(pow(2.,  limit),  2.)  / 3.; 
mean  = (alpha  / 2.)  * sum; 
variance  = (sum  * pow(alpha,  2.))  / 12.; 
printf("  Enter  Loop  gain,  Alpha:  0); 
scanf ( "%f%f" , &K,  &Alpha); 
for  (t  = 1;  t <=  Ensemble;  t++ ) { 
mse  = 0 . ; 

/*  Generation  of  normally  distributed  numbers  */ 
for  (i  = 0;  i < total  + N;  i++) 
x ( i j - 0 . ; 

for  (i  = 0;  i < total  - 1;  i++){ 
gauss  ran  = 0 . ; 
for  (J  - 1;  j O sum;  j++) 

/*  sum  = 12  for  : (sigma,  a * 1)  */ 
gauss_ran  +«  alpha  * (double) 

( nfrom( low, high ) /(double)  high); 
x[i]  = gauss  ran  - mean; 

} 

for  (i  = 0;  i <=  N - 1;  i++) 
for  (k  = 0;  k <=  total;  k++) 
h[i][k]  = 0; 

/*  LMS  Algorithm  for  Echo  cancellation  */ 
for  (k  = delay;  k <=  delay  + total  - 1;  k++){ 
y[kj  - Alpha  * x[k  - delay]; 

/*  y[k]=A  * x[k-delay]  */ 
y_hat[ k ] = 0 . ; 

for  (i  = 0;  i <=  N - 1;  i++){ 
if  ( k >=i ) 

y hat[k]  +=  x[k  - i]  * h[i][k]; 

} 

e [ t ] [ k ] = y [ k ] - y_hat [ k ] ; 
for  (i  = 0;  i <-  N - 1;  i++){ 
if  (k  >=i ) 

h[ i ] [ k+1 ] = h[ i ] [ k ] + 2.  * K*x [ k-i ] *e [ t ] ( k ] ; 
0 3.  S 6 

h[i][k+l]  » h( i ] [ k ] ; 

} 

mse  +=  pow(e[t][k],  2.); 

} 

mse  /=  total; 

printf("  MSE  = %f0,  mse); 

} 

for  (k  = delay;  k <-  delay  + total  - 1;  k++){ 
ave[ k ] * 0 . ; 
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for  (t  = 1;  t <=  Ensemble;  t++){ 
ave[k]  +=  pow(e[t][k],  2.); 

} 

ave[k]  /=  Ensemble; 

printf("  ave[%d]  = %f0,  k,  ave[k]); 
fprintf ( fp, "%d%f 0 , k - delay,  ave [ k ] ) ; 

} 

f close ( f p ) ; 


#include  <math.h> 

#include  <stdio.h> 

#def ine  SIZE  1000 

main( ) 

{ 

int  t,  i,  k,  delay,  j,  N,  Qfix,  total,  sum; 
int  Ensemble,  low,  high; 
double  var,  mea,  rave[SIZE]; 

double  limit,  alpha,  gauss_ran,  mean,  variance; 
double  QF,  K,  Alpha; 
double  round( ) , trunc(); 

double  y[SIZEj,  e[25][SIZEj,  h[ 256  ][ SIZE ] ; 
double  y_hat[SIZE],  x[SIZE],  roundmse; 

FILE  *fp,  *fopen(),  *fclose(); 
fp  = fopen( "outfxp" , "w" ) ; 
printf("  TOTAL  # of  numbers  (10000  ?), 

Ensemble  size  (3  ?)  :0); 
scanf ( " %d%d" , stotal,  &Ensemble); 
printf("  Enter  power  of  2 

for  dynamic  range(limit)  (4  ?):  0); 
scanf ("%f",  slimit); 

printf("  Enter  total  # of  TAPS  (256  ?), 

DELAY  (5  ? ) : 0 ) ; 
scanf ( "%d%d" , &N,  Sdelay); 
low  = 0 ; 
high  = 10000; 

alpha  - 6 . / pow(2.,  limit); 

sum  = pow(pow(2.,  limit),  2.)  / 3.; 

mean  = (alpha  / 2.)  * sum; 

variance  = (sum  * pow(alpha,  2.))  / 12.; 

printf("  Enter  Loop  gain,  Alpha,  Qfix:  0); 

scanf ( "%f%f%d" , &K,  &Alpha,  &Qfix); 

QF  = pow(2.,  (double)  (Qfix  - limit  - 1)); 

/*  1 bit  is  for  sign  */ 
for  (t  = 1;  t <=  Ensemble;  t++){ 
roundmse  - 0.; 
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/*  Generation  of  normally  distributed  numbers  */ 
for  (i  = 0;  i < total  + N;  i++){ 
x [ i ] = 0 . ; 

} 

for  (i  = 0;  i < total  - 1;  i++){ 
gauss  ran  = 0 . ; 
for  (J  = 1;  j <=  sum;  j++) 

/*  sum  = 12  for  : (sigma,  a = 1)  */ 
gauss  ran  +=  alpha  * (double) 

( nTrom( low, high ) /(double)  high); 
x[i]  = gauss  ran  - mean; 

} 

for  (i  - 0;  i <=  N - 1;  i++) 
for  (k  • 0;  k <=  total;  k++) 
h[i][k]  - 0; 

/*  Echo  canceller  for  fixed  point  version  */ 

for  (k  = delay;  k <=  delay  + total  - 1;  k++){ 
y[k]  = round( round (Alpha , QF ) * 

round(x[k  - delay ], QF ), QF ) ; 

/*  y[k]=A  * x[k-delay]  */ 
y_hat[ k ] = 0 . ; 

for  (i  = 0;  i <=  N - 1;  i++){ 

if  ( k >=i ) 

y_hat[k]  +=  round( round( x[ k - i],QF)  * 
round ( h[ i ] [ k ] , QF ) , 35 . ) ; 

/*  35  are  the  bits  which  are  available 
to  the  accumulator  */ 

} 

e [ t ] t k ] = y[k]  - y_hat[k]; 
for  (i  « 0;  i <-  N - 1;  i++){ 
if  (k  >=i) 

h[i](k+l]  = round ( round ( h[i][k],QF)  + 
round(2.  * K * round(x[k  - i ] , QF ) * 
round ( e [ t ][ k ] , QF ) , QF ) , QF ) ; 

else 

h[i][k+l]  - round ( h[ i ] [ k ] , QF ) ; 

} 

roundmse  +=  pow(e[t][k],  2.); 

} 

roundmse  /=  total; 

printf("  Fixed  point  MSE  - %f0,  roundmse); 

for  (k  = delay;  k <=  delay  + total  - 1;  k++){ 
rave ( k ] = 0 . ; 

for  (t  = 1;  t <=  Ensemble;  t++ ) { 
rave [ k ] +=  pow(e[t][k],  2.); 

} 

rave[k]  /-  Ensemble; 

printf("  rave[%d]  = %f0,  k,  rave[k]); 
fprintf ( fp, "%d%f0,  k - delay,  ravefkj); 

f close ( fp ) ; 

} 
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********************************************************* 

* 

* LNS  CANCELLER  * 

* * 
********************************************************* 


/ 


#include  <math.h> 
tinclude  <stdio.h> 

#define  SIZE  1500 
double  rad,  lowexp; 
int  Underflow; 
main( ) 

{ 

extern  int  Underflow; 
double  var,  mea; 

double  limit,  alpha,  gauss_ran,  mean, 
double  K,  Alpha,  LAlpha,  logadd(),  logmul() 
double  round( ) , lx[SIZE],  le[SIZE],  trunc(), 
double  y [ SIZE ] , y hat[SIZE],  h[50][SIZE]; 
double  ly[SIZE],  Ty  hat[SIZE],  lh[ 50 ] [ SIZE ] ; 
double  x [ SIZE ] , e [ ltfl ] [ SIZE ] , FCC( ) , Int,  ICC(); 
double  ltemp,  lave[SIZE],  12K,  logmse; 
int  sy [ SIZE ] , se [ SIZE ] , sh[ 50 ] [ SIZE ] ; 
int  stemp,  s2K,  Ensemble,  SAlpha,  sign( ) 


variance , 


QL,  QF; 


signmul ( ) ; 

int  sy_hat[ SIZE] , sx[SIZE],  Qfix,  Qlog,  signadd(); 
int  t,  i,  k,  delay,  j,  N,  total,  sum,  low,  high; 

FILE  *fp,  *fopen(),  *f close ( ) ; 
fp*=  f open(  "outlogl " , "w"); 
rad  = 2 . ; 

printf ( " TOTAL  # of  numbers  (10000  ?), 

Ensemble  size  (3  ?):0); 
scanf ( "%d%d" , &total,  SEnsemble); 
printf ( "Enter  power  of  2 for 

dynamic  range(limit)  (4  ?)  0); 
printf ("  and  # of  INT  bits  for 
logarithmic  system  (9  ?):  0); 
scanf ( "%f%f" , slimit,  &Int); 
lowexp  = - pow( 2 . , Int); 
printf ("  Enter  total  # of  TAPS  (50  ?), 

DELAY  (5  ? ) : 0 ) ; 
scanf ( "%d%d" , &N,  Sdelay); 
printf ("  Enter  Loop  gain,  Alpha, 
scanf ( "%f%f%d%d" , &K,  SAlpha, 

QF  = pow(2.,  (double)  (Qfix  - 
/*  1 bit  is  for  sign  */ 

QL  = pow( 2 . , (double)  (Qlog  - 2 - Int)) 

/*  2 bits  are  for  signs  */ 
low  = 0; 
high  = 10000; 

alpha  = 6.  / pow(2.,  limit); 
sum  = pow( pow( 2 . , limit),  2.)  / 3.; 


Qfix,  Qlog 
&Qf ix , &Qlog ) ; 
limit  - 1 ) ) ; 


0) 
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mean  = (alpha  / 2.)  * sum; 

variance  = (sum  * pow(alpha,  2.))  / 12.; 

Underflow  = 0; 

LAlpha  = roundf FCC( Alpha ) , 12.); 

SAlpha  = sign( Alpha); 

12K  = round( logmul ( round( FCC( 2 . ) , 12 . ) , sign(2.), 
r ound (FCC(K) ,12. ) , sign(K)),  QL); 
s2K  = signmul ( round( FCC( 2 .), 12 .) , sign(2.), 
round ( FCC( K ), 12 .) , sign(K)); 
for  (t  = 1;  t <=  Ensemble;  t++){ 
logmse  = 0 . ; 

/*  Generation  of  normally  distributed  numbers  */ 
for  (i  = 0;  i < total  + N;  i++){ 

1 x [ i ] = 0 . ; 
sxfi]  = 0; 

} 

for  (i  = 0;  i < total  - 1;  i++){ 
gauss  ran  = 0 . ; 
for  (5=1;  j <=  sum;  j++) 

/*  sum  = 12  for  : (sigma,  a = 1)  */ 
gauss_ran  +=  alpha  * 

( double )( nfrom( low, high ) /(double)  high); 
x[i]  «*  gauss  ran  - mean; 
lx[i]  = roun<3(  FCC(x[  i ] ) , 12.); 
sx [ i ] = sign ( x [ i ] ) ; 

} 

for  (i  = 0;  i <=  N - 1;  i++){ 
for  (k  = 0;  k <=  total;  k++){ 
lh[i][k]  = 0.; 
sh[i][k]  = 0; 

} 

} 

/*  Echo  canceller  for  Logarithmic  version  */ 

for  (k  = delay;  k <=  delay  + total  - 1;  k++){ 
ly[  k ] ■=  round(  logmul  ( LAlpha,  SAlpha, 

lx[k  - delay],  sx[k  - delay]),  QL); 
sy[k]  = signmul ( LAlpha , SAlpha, 

lx[k  - delay],  sx(k  - delay]); 

/*  y[k]=A  * x[k-delay]  */ 
ly_hat[k]  = 0.; 
sy_hat[kj  = 0; 

for  (i  = 0;  i <=  N - 1;  i++){ 
if  ( k >»i  ) { 

ltemp  = round( logmul ( lx [ k - i],  sx[k  - i], 
lh[i][k],  sh[ i ] [k] ) , QL); 
stemp  = signmul(lx[k  - i],  sx[k  - i], 
lh[ i ] [ k ] , sh[ i ] [ k ] ) ; 

ly_hat[k]  = round( logadd( ly_hat[ k ] , sy_hat[k], 
ltemp,  stemp ) , QL) ; 
sy_hat[k]  » signadd ( ly_hat [ k ] , 
sy  hat[k],  ltemp,  stemp); 

} 

} 
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le[k]  = round(logadd(ly[k] , sy[k], 

ly  hat [ k ] , - sy_hat[k]),  QL); 
se [ k ] = signaddTly [ k ] , sy[k], 
ly_hat [ k ] , - sy  hat[k] ) ; 
for  ( i = 0;  i <=  N - I;  i + +)  { 
if  ( k >=i ) { 

stemp  = signmul(lx[k  - i], 
sx  [ k - i ] , le  [ k ] , se  [ k ] ) ; 
ltemp  = round( logmul ( lx[ k - i],  sx[k  - i], 
le [ k ] , se[k]),QL); 

stemp  = signmul ( 12k  , s2K,  ltemp,  stemp); 
ltemp  = round( logmul ( 12K,  s2K, 

ltemp,  stemp),  QL ) ; 

lh [ i ] [ k+1 ] = round(logadd(lh[ i ] [k] , sh[i][k], 
ltemp,  stemp),  QL ) ; 

sh [ i ] [ k+1 ] = signadd( lh[ i ] [ k ] , sh[i][k], 
ltemp,  stemp); 

else  { 

lh [ i ] [ k+1 ] = lh[ i ] [ k ] ; 
sh [ i ] [ k + 1 ] = s h [ i ] [ k ] ; 

} 

e [ t ] [ k ] - ICC ( le [ k ] , se[k]  ) ; 
logmse  +=  pow(e[t][k],  2.); 
logmse  /=  total; 

^ printf( "logarithmic  MSE  = %f0,  logmse); 

for  (k  - delay;  k <=  delay  + total  - 1;  k++){ 
lave [ k ] = 0 . ; 

for  (t  = 1;  t <=  Ensemble;  t++ ) { 
lave [ k ] +=  pow(e[t][k],  2.); 

lave [ k ] /=  Ensemble; 

fprintf ( fp, "%d%f0,  k - delay,  lavefk]); 
f close ( fp ) ; 

printf ( " Underflow  = %d0, Underflow) ; 


double  logmul ( argl , si,  arg2,  s2) 
double  argl,  arg2; 
int  si,  s2; 

{ 

double  res; 
extern  int  Underflow; 
extern  double  lowexp; 
if  ( (si  — 0)  ||  ( s2  — 0) ) 
res  = 0 . ; 
else  { 
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res  = argl  + arg2; 
if  (fabs(res)  < pow(rad,  lowexp)) 
res  = 0 . ; 

Underf low++ ; 

} 

} 

return ( res  ) ; 

} 

signmul ( argl , si,  arg2,  s2) 
double  argl,  arg2; 
i n t si,  s 2 ; 

{ 

extern  double  lowexp; 
int  sign; 

if  ( (si  — 0)  ||  ( s2  ==  0) ) 

sign  = 0; 

else  if  (fabs(argl  + arg2)  < pow(rad, 
sign  = 0; 

else  if  (si  ==  s2) 
sign  = 1; 
else 

sign  = -1; 
return! sign ) ; 


lowexp ) ) 
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